Improvement for Convolutional Neural Networks in Image Classiﬁcation Using Long Skip Connection

: In this paper, we examine and research the effect of long skip connection on convolutional neural networks (CNNs) for the tasks of image (surface defect) classiﬁcation. The standard popular models only apply short skip connection inside blocks (layers with the same size). We apply the long version of residual connection on several proposed models, which aims to reuse the lost spatial knowledge from the layers close to input. For some models, Depthwise Separable Convolution is used rather than traditional convolution in order to reduce both count of parameters and ﬂoating-point operations per second (FLOPs). Comparative experiments of the newly upgraded models and some popular models have been carried out on different datasets including Bamboo strips datasets and a reduced version of ImageNet. The modiﬁed version of DenseNet 121 (we call MDenseNet 121) achieves higher validation accuracy while it has about 75% of weights and FLOPs in comparison to the original DenseNet 121.


Introduction
Wood in general and bamboo in particular have become one of the most popular materials today due to their environmental friendliness. Because of its popularity, product quality requirement becomes more important and can be a crucial aspect in the industrial production line. Many studies in the field of defect detection and classification, in regard of digital image processing, aim to minimize or replace human vision and decision methodologies with artificial techniques [1]. Recent image processing-based methods could deal with defect classification at a decent performance level but they were limited to detecting simple, distinctive defects from the background. Silvén, O [2] used self-organizing map (SOM) for discriminating between sound wood and defects. Qi, X. [3] proposed an algorithm by combination with Blob analysis algorithm and image preprocessing approach to detect the defects. Haindl, M. [4] used fast multispectral texture defect detection method based on the underlying three-dimensional spatial probabilistic image model. Xiansheng [5] provided an online method of bamboo defect inspection and classification. Wang, X. [6] proposed a new surface grading approach by integrating the color and texture of bamboo strips based on Gaussian multi-scale space.
Recent studies in the deep learning field have proved that most problems in machine vision that were difficult in the traditional computer vision methods, were solved using convolutional neural network (CNNs). Krizhevsky et al. [7] first introduced CNN with AlexNet that trained with the difficult ImageNet dataset, solved the classification task with top-1 and top-5 error rates of 37.5% and 17.0%, which was by far better than previous works. Simonyan and Zisserman [8] proved that CNNs had to have a deep network of layers in order for this hierarchical representation of visual data to work. However, stacking multiple convolution layers can cause the problem of vanishing gradient, making the model difficult to learn. ResNet [9] and Highway Networks [10] introduced the idea that vanishing gradient problem could be tackled by by-passing the signal from layer to layer

Related Works
Skip connection has become a standard in today's convolutional architectures. Short skip connections are used along with consecutive convolutional layers that do not change the dimension inside blocks of layers until a convolution with stride equal to two or pooling layers; while long skip connections usually exist in the architectures that are symmetrical.
Highway networks used learned gating mechanisms (as a first form of skip connection) to regulate information flow, inspired by Long Short-Term Memory (LSTM) [14] recurrent neural networks. The gating mechanisms allowed neural networks to have paths for information to follow across different layers ("information highways"), using and spreading knowledge from previous layers. The idea was explained again in ResNet that: the deeper layers are identity mapping, and the other layers are learned from the shallower layers. By this idea, ResNet's authors created a deep residual learning architecture with 152 layers and won ILSVRC 2015 with an incredible error rate of 3.6%. Another popular model that utilized the effect of skip connection is DenseNet. This architecture heavily used feature concatenation so as to ensure maximum information flow between layers in the network. This was achieved by connecting via concatenation of all layers directly with each other, as opposed to ResNet. The DenseNet's authors claimed that their idea led to an enormous amount of feature channels on the last layers of the network, to more compact models, and extreme feature reusability. U-Net [15] performed long skip connections between layers in encoder and decoder; while the encoder did the convolution work, the decoder did the deconvolution (transpose convolution), which is opposed to the first path. By introducing skip connections in the encoder-decoder architecture, fine-grained details can be recovered in the prediction. Even though there was no theoretical justification, symmetrical long skip connections worked incredibly effectively in dense prediction tasks (medical image segmentation).
For the task of improving a model's computational performance, Depthwise Separable Convolution was first introduced in Xception as an Inception [16] module and placed throughout the deep learning architecture. Depthwise Separable Convolution is made by a Depthwise Convolution (channel-wise D K × D K spatial convolution) followed by a pointwise convolution (1 × 1 convolution) to change the dimension. With D K × D K equals 3 × 3, Depthwise Separable Convolution lowers computation effort, with only a small reduction in accuracy. MobileNet adopts the idea and built network architecture that even outperforms SqueezeNet [17] and AlexNet while the multi-adds and parameters were much fewer. MobileNet v2 [18] introduced two new features to the base architecture of the first version, which were linear bottlenecks between the layers, and shortcut connections between the bottlenecks. The latest MobileNet (v3) [19] family member used h-swish instead of sigmoid function [20] and mobile-friendly Squeeze-and-Excitation blocks [21], together with model architecture auto searching. The two later versions of MobileNet family achieved about 75% accuracy and less than 5 ms latency (running on Pixel4 Edge TPU mobile).
Recent model architectures have focused on balancing accuracy and latency; Mnas-Net [22] was the first to introduce automated neural architecture search approach for designing mobile models using reinforcement learning. EfficientNet [23] uniformly scaled each dimension (width, depth and resolution) with a fixed set of scaling coefficients, achieving much better accuracy and efficiency than previous CNNs.

Methodology
Consider x 0 as input image, passed through a convolutional network, L is number of layers in the network, while H l is the non-linear transformation of l th layer. Output of l th is x l .

Skip Connection
Traditional feed-forward convolutional networks connect the output of the l th layer as input to the (l + 1) th layer, which gives rise to the following layer transition: x l = H l (x l−1 ). The ResNet's authors added a skip connection that bypasses the non-linear transformations with a function as identity as shown in Figure 1: a Depthwise Convolution (channel-wise × spatial convolution) followed by a pointwise convolution (1x1 convolution) to change the dimension. With × equals 3 × 3, Depthwise Separable Convolution lowers computation effort, with only a small reduction in accuracy. MobileNet adopts the idea and built network architecture that even outperforms SqueezeNet [17] and AlexNet while the multi-adds and parameters were much fewer. MobileNet v2 [1818] introduced two new features to the base architecture of the first version, which were linear bottlenecks between the layers, and shortcut connections between the bottlenecks. The latest MobileNet (v3) [19] family member used h-swish instead of sigmoid function [20] and mobile-friendly Squeeze-and-Excitation blocks [21], together with model architecture auto searching. The two later versions of MobileNet family achieved about 75% accuracy and less than 5 ms latency (running on Pixel4 Edge TPU mobile).
Recent model architectures have focused on balancing accuracy and latency; MnasNet [22] was the first to introduce automated neural architecture search approach for designing mobile models using reinforcement learning. EfficientNet [23] uniformly scaled each dimension (width, depth and resolution) with a fixed set of scaling coefficients, achieving much better accuracy and efficiency than previous CNNs.

Methodology
Consider as input image, passed through a convolutional network, is number of layers in the network, while is the non-linear transformation of layer. Output of is .

Skip Connection
Traditional feed-forward convolutional networks connect the output of the l th layer as input to the ( + 1) layer, which gives rise to the following layer transition: = ( ). The ResNet's authors added a skip connection that bypasses the non-linear transformations with a function as identity as shown in Figure 1: An advantage of ResNet was that the gradient can flow directly through the identity function from early, close to input layers to the subsequent layers.   An advantage of ResNet was that the gradient can flow directly through the identity function from early, close to input layers to the subsequent layers.

Dense Connections
Huang et al. [11] introduced dense connections that connected every layer in feedforward direction. Different from ResNet, DenseNet did not sum the output feature maps of the layer with the incoming feature maps but concatenated them instead. Consequently, the equation for the l th layer: where ([x 0 , x 1 , . . . , x l−1 ]) is the concatenation of output feature map of layers 0, 1, . . . , l − 1.
Since this grouping of feature maps cannot be done when the sizes of them are different, DenseNet is divided into Dense Blocks, where the dimensions of the feature maps remain constant within a block, but the number of filters changes between them. Huang et al. [11] also created Transition Layers, which stay between blocks and take care of the downsampling by applying a Batch Normalization (BN), a 1 × 1 convolution and a 2 × 2 pooling layers.
Another important feature of DenseNet is the growth rate. As defined in [11], the l th layer had k 0 + k × (l − 1) input feature map; where k 0 was the number of feature maps at input layer, and k was the growth rate.

Depthwise Separable Convolution
Srivastava et al. [10] and Huang et al. [11] proposed Separable Convolutions layer ("s-conv" for short), which performed first a Depthwise Spatial Convolution (which acts on each input channel separately) followed by a pointwise convolution which mixes together the resulting output channels.
A standard 2D convolutional layer takes D H × D W × M as input feature map I and produces a D h × D w × N output feature map O where D W and D H are the spatial width and height of a input feature map, M is the number of input depth, D w and D h are the spatial width and height of a output feature map and N is the number of output depth.
Kernel K performs convolution on input feature map (with zero padding and stride one), has size D K × D K × M × N, where D K is the spatial dimension of the kernel (assumed to be square) and M is the number of input channels and N is the number of output channels as defined. Normal convolutions have the computational cost of: The computational cost depends multiplicatively on the number of input channels M, the number of output channels N, the kernel size D K × D K and the feature map size Depthwise Separable Convolution splits a kernel into 2 separate kernels that do two convolutions: the Depthwise Spatial Convolution and the pointwise convolution. Depthwise Spatial Convolution is the channel-wise D K × D K spatial convolution. Pointwise convolution is the 1 × 1 convolution to change the dimension. Separable convolutions have the computational cost: Performing division, the computational reduction is:

Formatting of Mathematical Components
DenseNet [11], U-Net [15] and V-Net [24] have showed that convolutional networks can be significantly deeper, more accurate, and simple to train if they contain shortcut connections between layers close to the input and ones close to the output.
Inspired by these ideas, we first create bottlenecked block with skip connection, dense connectivity and growth rate equaled to 32 similar to DenseNet's architecture and connect every block to eaxh other in feed-forward fashion. We add longer skip connections to pass features from upper layers (blocks) to lower layers (blocks). Generally, the t th block receives the feature maps of all early blocks n 0 , n 1 , . . . , n t−1 . The function of the t th layer is presented as follows: where (t − 1) th block has l layer. We use 3 × 3 separable convolution layer instead of 3 × 3 normal convolution, as it reduces number of parameters as well as FLOPs, thus lowering computational effort.ơ

Bottlenecked Layers
It has been shown in [9,16] that a 1 × 1 convolution can be presented as bottleneck layer before each 3 × 3 convolution to reduce the number of input feature maps, therefore improving computational performance. We design our building blocks as this method, stacking 1 × 1 conv, 3 × 3 separable-conv, then 1 × 1 conv, where 1 × 1 conv(s) function to reducing and restore dimensions, while 3 × 3 s-conv(s) do the convolution with lower input/output dimensions.

Upstream and Pooling Layers
An important aim of convolution networks is to downsample layers in which feature maps sizes change. To deal with the blocks' depth difference, we add 1 × 1 conv to match block to block depth. Upstream2D and MaxPooling2D layers are added on the connections between blocks as shown in Figure 2.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 15 DenseNet [11], U-Net [15] and V-Net [24] have showed that convolutional networks can be significantly deeper, more accurate, and simple to train if they contain shortcut connections between layers close to the input and ones close to the output.
Inspired by these ideas, we first create bottlenecked block with skip connection, dense connectivity and growth rate equaled to 32 similar to DenseNet's architecture and connect every block to eaxh other in feed-forward fashion. We add longer skip connections to pass features from upper layers (blocks) to lower layers (blocks). Generally, the block receives the feature maps of all early blocks , , … , . The function of the layer is presented as follows: where ( − 1) block has l layer.
We use 3 x 3 separable convolution layer instead of 3 × 3 normal convolution, as it reduces number of parameters as well as FLOPs, thus lowering computational effort.ơ

Bottlenecked Layers
It has been shown in [9] and [16] that a 1 × 1 convolution can be presented as bottleneck layer before each 3 × 3 convolution to reduce the number of input feature maps, therefore improving computational performance. We design our building blocks as this method, stacking 1 × 1 conv, 3 × 3 separable-conv, then 1 × 1 conv, where 1 × 1 conv(s) function to reducing and restore dimensions, while 3 × 3 s-conv(s) do the convolution with lower input/output dimensions.

Upstream and Pooling Layers
An important aim of convolution networks is to downsample layers in which feature maps sizes change. To deal with the blocks' depth difference, we add 1 × 1 conv to match block to block depth. Upstream2D and MaxPooling2D layers are added on the connections between blocks as shown in Figure 2.  Figure 3 illustrates the simple M-DenseNet architecture. Similar to DenseNet, we create dense connection between layers in each block, then add transition layer(s) to perform downsampling. However, we change all 3 × 3 conv layers in DenseNet to s-conv layer, then add Add block as shown in Table 1.  Figure 3 illustrates the simple M-DenseNet architecture. Similar to DenseNet, we create dense connection between layers in each block, then add transition layer(s) to perform downsampling. However, we change all 3 × 3 conv layers in DenseNet to s-conv layer, then add Add block as shown in Table 1 Table 1 shows different configuration of the M-DenseNet. Blocks are shown in bracket with number of repetitions. Upsampling2d with 1 × 1 conv2d helps add blocks match dimension. Introduced initially before the Block 1, a convolution with 64 output channels, 7 × 7 kernel size and 2 strides are applied on the input image. The block architecture, as shown in Figure 2, follows the idea of Deeper Bottleneck Architectures [11]. Add block (m -n) means that we create long connection between block m and block n, block 0 is the output feature map of the first MaxPooling2D layer and block n is the output feature map of Transition Layer n th .

M-DenseNet Architecture
Dense (num_classes, SoftMax) num_classes: number of classes. Table 1 shows different configuration of the M-DenseNet. Blocks are shown in bracket with number of repetitions. Upsampling2d with 1 × 1 conv2d helps add blocks match dimension. Introduced initially before the Block 1, a convolution with 64 output channels, 7 × 7 kernel size and 2 strides are applied on the input image. The block architecture, as shown in Figure 2, follows the idea of Deeper Bottleneck Architectures [11]. Add block (m-n) means that we create long connection between block m and block n, block 0 is the output feature map of the first MaxPooling2D layer and block n is the output feature map of Transition Layer n th . The Bamboo images are taken by using high speed Area Color camera (BASLER-acA1920-150uc,~200 frames per second) with a lens of 8 mm focal length (TAMRON), a frame grabber and a PC. The camera is fixed above the bamboo strip and set focus on the surface. Because of the importance of keeping light in an undisturbed environment, a square shaped LED light is fixed above the bamboo strips. The type of lights, with the addition of a black box as shown in Figure 4 below, are effective against reflection and shadow, as well as the disturbing light from the environment.

Case Study on the Bamboo Strips Dataset
The Bamboo images are taken by using high speed Area Color camera (BASLER-acA1920--150uc, ~200 frames per second) with a lens of 8 mm focal length (TAMRON), a frame grabber and a PC. The camera is fixed above the bamboo strip and set focus on the surface. Because of the importance of keeping light in an undisturbed environment, a square shaped LED light is fixed above the bamboo strips. The type of lights, with the addition of a black box as shown in Figure 4 below, are effective against reflection and shadow, as well as the disturbing light from the environment.
Typical bamboo strips have 2 cm width, so in order to improve production speed, we recorded images of five parallel strips over the conveyor. To deal with lens distortion, we calibrated the camera and obtain its matrix together with distortion coefficients, by the help of OpenCV library [25]. The images are then split into five equals parts (removing unwanted areas), containing each bamboo strip with similar height and width, as described in Figure 5. Typical bamboo strips have 2 cm width, so in order to improve production speed, we recorded images of five parallel strips over the conveyor. To deal with lens distortion, we calibrated the camera and obtain its matrix together with distortion coefficients, by the help of OpenCV library [25]. The images are then split into five equals parts (removing unwanted areas), containing each bamboo strip with similar height and width, as described in Figure 5.
We build a Bamboo strips dataset (about 25,000 images), which contains seven classes that are classified manually. The dataset can be accessed at https://github.com/hieuth133 /Bamboo, accessed on 24 January 2021. The images shape is 260 × 500 × 3 and the dataset contains seven classes: one for good bamboo strips, five classes contain non-qualified bamboo strips images (detailed in Section 3.1), and the last one is background (images of conveyor). Example of bamboo defect images are shown in Figure 6. Images are cropped to 224 × 224 × 3 with per-pixel mean subtracted. Because the number of images containing defect is minimal compared to the good bamboo strips images, we upsample the defect bamboo classes by using image augmentation. Using Keras ImageDataGenerator library [26], we generate new images by horizontal and vertical flipping, rotating (±2 degree), shifting height and width, changing brightness. Both original and generated images are used together to form the Bamboo dataset. Figure 7 points out the distribution of number of images from seven classes in percentages.   We build a Bamboo strips dataset (about 25,000 images), which contains seven classes that are classified manually. The dataset can be accessed at https://github.com/hieuth133/Bamboo, 24 January 2021. The images shape is 260 × 500 × 3 and the dataset contains seven classes: one for good bamboo strips, five classes contain non-qualified bamboo strips images (detailed in Section 3.1), and the last one is background (images of conveyor). Example of bamboo defect images are shown in Figure 6. Images are cropped to 224 × 224 × 3 with per-pixel mean subtracted. Because the number of images containing defect is minimal compared to the good bamboo strips images, we upsample the defect bamboo classes by using image augmentation. Using Keras Im-ageDataGenerator library [27], we generate new images by horizontal and vertical flipping, rotating (± 2 degree), shifting height and width, changing brightness. Both original and generated images are used together to form the Bamboo dataset. Figure 7 points out the distribution of number of images from seven classes in percentages.   We build a Bamboo strips dataset (about 25,000 images), which contains seven classes that are classified manually. The dataset can be accessed at https://github.com/hieuth133/Bamboo, 24 January 2021. The images shape is 260 × 500 × 3 and the dataset contains seven classes: one for good bamboo strips, five classes contain non-qualified bamboo strips images (detailed in Section 3.1), and the last one is background (images of conveyor). Example of bamboo defect images are shown in Figure 6. Images are cropped to 224 × 224 × 3 with per-pixel mean subtracted. Because the number of images containing defect is minimal compared to the good bamboo strips images, we upsample the defect bamboo classes by using image augmentation. Using Keras Im-ageDataGenerator library [27], we generate new images by horizontal and vertical flipping, rotating (± 2 degree), shifting height and width, changing brightness. Both original and generated images are used together to form the Bamboo dataset. Figure 7 points out the distribution of number of images from seven classes in percentages.

Training Model
We evaluate the M-DenseNet on our Bamboo strips dataset, as shown in Figure 8.

Training Model
We evaluate the M-DenseNet on our Bamboo strips dataset, as shown in Figure 8. We choose batch size of 32 for 30 epochs and stochastic gradient descent (SGD), with learning rate equal to 10 −3 , divided by 10 at epoch 10 and 20, weight decay = 10 −6 , momentum = 0.9, and Nesterov momentum is applied. Training time is about 3 hours on Tesla V100-SXM2 with 16GB of VRAM, and computation capability is 7.0. Table 2 shows the results of M-DenseNet(s), DenseNet(s) and ResNet(s) trained on Bamboo dataset. All models' accuracies are qualified for bamboo industry (≥95%), and M-DenseNet utilizes parameters more efficient than other models, which appears in Figure 9 and image prediction as shown in Figure 10. Model accuracy is defined as an average number of items correctly identified as either truly positive or truly negative out of the total number of items: where k is number of classes, "t" is true, "f " is false, "p" is positive and "n" is negative.

Training Model
We evaluate the M-DenseNet on our Bamboo strips dataset, as shown in Figure 8. We choose batch size of 32 for 30 epochs and stochastic gradient descent (SGD), with learning rate equal to 10 , divided by 10 at epoch 10 and 20, weight decay = 10 , momentum = 0.9, and Nesterov momentum is applied. Training time is about 3 hours on Tesla V100-SXM2 with 16GB of VRAM, and computation capability is 7.0. Table 2 shows the results of M-DenseNet(s), DenseNet(s) and ResNet(s) trained on Bamboo dataset. All models' accuracies are qualified for bamboo industry (>= 95%), and M-DenseNet utilizes parameters more efficient than other models, which appears in Figure 9 and image prediction as shown in Figure 10. Model accuracy is defined as an average number of items correctly identified as either truly positive or truly negative out of the total number of items: Stick with outter skin of bamboo 13%

Case Study on the Reduced Version of ImageNet
We also test the M-DenseNet and DenseNet families without/with long skip connection and s-conv. We use another dataset, which is a reduced version of ImageNet [27] with 100 classes and about 500 images in each class. The dataset is also accessible via the same github link above (Section 4.1.1). All models are trained by the same technique with extra augmentation [28][29][30] and normalization [31]; batch size is 32, Adam optimizer, 60 epochs and learning rate is 10 . Table 3 presents the value of accuracy and FLOPs in comparison between DenseNet and M-DenseNet. By observation, models with long skip connection converge faster, as shown in Figure 11 and has a better overall result. Parameters and FLOPs of M-DenseNet family equal roughly 80% of original DenseNet, while the accuracy increases about 3-5%.

Case Study on the Reduced Version of ImageNet
We also test the M-DenseNet and DenseNet families without/with long skip connection and s-conv. We use another dataset, which is a reduced version of ImageNet [27] with 100 classes and about 500 images in each class. The dataset is also accessible via the same github link above (Section 4.1.1). All models are trained by the same technique with extra augmentation [28][29][30] and normalization [31]; batch size is 32, Adam optimizer, 60 epochs and learning rate is 10 −3 . Table 3 presents the value of accuracy and FLOPs in comparison between DenseNet and M-DenseNet. By observation, models with long skip connection converge faster, as shown in Figure 11 and has a better overall result. Parameters and FLOPs of M-DenseNet family equal roughly 80% of original DenseNet, while the accuracy increases about 3-5%. augmentation [28][29][30] and normalization [31]; batch size is 32, Adam optimizer, 60 epochs and learning rate is 10 . Table 3 presents the value of accuracy and FLOPs in comparison between DenseNet and M-DenseNet. By observation, models with long skip connection converge faster, as shown in Figure 11 and has a better overall result. Parameters and FLOPs of M-DenseNet family equal roughly 80% of original DenseNet, while the accuracy increases about 3-5%.

79.42
Numbers in bold represent best result.  Testing usage with other models: We make modifications to other models, similar to what we have done to create MDenseNet in Section 3.7. The models are EfficientNet [23] and MobileNet v1, v2, v3 [13,18,19]. We use a reduced version of ImageNet and training technique the same as that explained in Section 4.2. Below is the detail of comparison. Figures 12 and 13 point out that our modifying models perform better than the original ones with small decrease in FLOPs (6-3% for MobileNet and 2-1% for EfficientNet). This proves the effectiveness of the long skip connections to performance of CNN. Testing usage with other models: We make modifications to other models, similar to what we have done to create MDenseNet in Section 3.7. The models are EfficientNet [23] and MobileNet v1, v2, v3 [13,18,19]. We use a reduced version of ImageNet and training technique the same as that explained in section 4.2. Below is the detail of comparison. Figures 12 and 13 point out that our modifying models perform better than the original ones with small decrease in FLOPs (6-3% for MobileNet and 2-1% for EfficientNet). This proves the effectiveness of the long skip connections to performance of CNN.

Ablation Study
We conducted an ablation study to point out the effectiveness of long skip connection to the performance of the models. We used EfficientNet B0 and three modified models based on EfficientNet B0 for testing.
The configuration for training and dataset is the same as that explained in Section 4.2. Figure 14 shows the validation loss while training 4 models in 60 epochs. Models with more long skip connections perform better than ones with less long connections.

Ablation Study
We conducted an ablation study to point out the effectiveness of long skip connection to the performance of the models. We used EfficientNet B0 and three modified models based on EfficientNet B0 for testing.
The configuration for training and dataset is the same as that explained in Section 4.2. Figure 14 shows the validation loss while training 4 models in 60 epochs. Models with more long skip connections perform better than ones with less long connections.

Discussion
As noted above, the modification adds to the model and turns it into a larger version of DenseNet. The upgrade seems to be small but it led to notable consequences. We present a few discussions and experiments to prove the efficiency of our proposed method.
Feature reuse: Generally, the feature maps close to the input detect small or finegrained detail, whereas feature maps close to the output of the model capture more general features. The connection between blocks encourages layers to learn both small detail

Discussion
As noted above, the modification adds to the model and turns it into a larger version of DenseNet. The upgrade seems to be small but it led to notable consequences. We present a few discussions and experiments to prove the efficiency of our proposed method.
Feature reuse: Generally, the feature maps close to the input detect small or finegrained detail, whereas feature maps close to the output of the model capture more general features. The connection between blocks encourages layers to learn both small detail and general nature of the object. The ablation study also proves the effect of skip connections that is more connections results in better performance.
Adaptability: The method of adding long skip connections has the ability to fit well to other models, from lightweight one like MobileNet family to well-designed EfficientNet family. Comparative experiments show that the modified models converge faster and have better performance.
Difficulty: In order to perform long skip connection between layers having different size, we have to equal the shape of two layers (detail in Section 3.6). This task can consume large GPU memory capacity and is unable to work with small size GPU, as the shape of feature map of the long skip connection layer is very large.

Conclusions
In this paper, we have proposed a modification to improve the performance of CNNs by using the effect of long skip connections. It creates long connections between blocks of layers with different sizes of feature map. In our experiment, the models with long skip connections tend to converge faster, without overfitting. The MDenseNet 121 model achieves higher validation accuracy while it has about 75% of weights and FLOPs in comparison to the original DenseNet 121. Adding long skip connections also helps MobileNets and EfficientNets families to improve their performances (tested with a reduced version of ImageNet).
The models with long skip connections have benefits in comparison to their counterparts, as it enhances feature reuse throughout the models, encourage models to learn both the fine details (coming from layers close to input) and the more general details (coming from layers close to output) of the objects. Moreover, the proposed modification is highly adaptable to many models, from lightweight to heavily parameterized models. As ad limitation, we experience difficulty in training big models, which is a result of the large shape of the feature map in the skip connection.
We will try to control the amount of knowledge when performing long skip connection in further experiments. In future work, we will explore this architecture with deeper layers, while maintaining its performance to apply to different tasks such as segmentation or object detection.

Conflicts of Interest:
The authors declare no conflict of interest.