UPANets: Learning from the Universal Pixel Attention Neworks

With the successful development in computer vision, building a deep convolutional neural network (CNNs) has been mainstream, considering the character of shared parameters in a convolutional layer. Stacking convolutional layers into a deep structure improves performance, but over-stacking also ramps up the needed resources for GPUs. Seeing another surge of Transformers in computer vision, the issue has aroused severely. A resource-hungry model is hardly implemented for limited hardware or single-customers-based GPU. Therefore, this work focuses on these concerns and proposes an efficient but robust backbone, which equips with channel and spatial direction attentions, so the attentions help to expand receptive fields in shallow convolutional layers and pass the information to every layer. An attention-boosted network based on already efficient CNNs, Universal Pixel Attention Networks (UPANets), is proposed. Through a series of experiments, UPANets fulfil the purposes of learning global information with less needed resources and outshine many existing SOTAs in CIFAR-{10, 100}.


Introduction
The development of computer vision has experienced a range of trends in this decade. Several introducing models [1][2][3][4][5][6] in open datasets competition significantly improved the accuracy of the image classification, which includes deep convolutional neural networks (CNNs) with residual calculation [7][8][9][10][11][12][13][14]. With the deep CNNs from stacking convolutional layers, models could capture local characteristics and global profiles with the increasing receptive fields [15]. However, this deep policy will raise the needed parameters and then makes one customer-based GPU unable to hold it. Besides CNNs, vision in Transformer (ViT) [16] has opened a path of applying the pure Multi-Head Attentions, which is from the natural language processing, to classify images by learning global information. While ViT arouses even influential works [17,18], we are facing a more severe issue of draining GPUs than deep CNNs because most Transformer-based networks require more powerful GPUs with large exclusive CUDA memory to calculate. Although sparse Attention in Informer [19] is trying to ameliorate the burden on GPUs, training Transformer-based models in a customer-based GPU to gain a decent performance remains impractical. It thus motivates this work to find a balance between computational costs and capturing image information globally.
To unleash the calculating pressure in a GPU and have a decent performance simultaneously, making a layer in shallow depth equipped with a broad receptive field is critical. Namely, if we can make layers mature quickly (in early depth), it is not necessary to have a deep structure to increase receptive fields. Instead of choosing already power-hungry Transformer-based structures, endowing learning global information ability to CNNs is rational because of the sharing filters mechanism in convolutional layers. Then, the issue that needs to be addressed is how to endow learning global information ability to CNNs rather than stacking convolutional layers to increase the receptive field. This work proposes Channel Pixel Attention (CPA) for helping convolutional layers to obtain global information directly, as shown in Figure 1. By CPA, models can combine information across the channel to generate more complex feature maps and further make shallow depth layers process similarly to deep depth layers.
Entropy 2022, 24, x FOR PEER REVIEW 2 of 24 have a deep structure to increase receptive fields. Instead of choosing already power-hungry Transformer-based structures, endowing learning global information ability to CNNs is rational because of the sharing filters mechanism in convolutional layers. Then, the issue that needs to be addressed is how to endow learning global information ability to CNNs rather than stacking convolutional layers to increase the receptive field. This work proposes Channel Pixel Attention (CPA) for helping convolutional layers to obtain global information directly, as shown in Error! Reference source not found.. By CPA, models can combine information across the channel to generate more complex feature maps and further make shallow depth layers process similarly to deep depth layers. In line with the same notion and the observations in [20], another direction to boost model learning and help information transport well among convolutional layers is building connections among a block, which is usually a unit with stacking of multiple convolutional layers and stacking as a layer module. To amplify the CPA effect, a hybrid connection with CPA is proposed. UPA blocks process lossless information from concatenating multiple UPA blocks in a stack and filter out vital information through residual connection. By this operation, the received information is not only from the last block but also the accumulating information until this block, so the CPA in each block can further absorb features from other blocks to amplify the receptive fields.
To transmit lossless learned information and feedback among layers, connecting each layer module is common in many object-detection tasks (e.g., saliency, semantic, and instance objects detection) within an auto-encoder-based structure in which the input and output share the exact image size with a bottleneck in a network. Nevertheless, the same picture is not seen in image classification. In contrast, the effect of merely applying residual and concatenating connections to prevent information loss will eventually saturate [21]. For this reason, the Extreme Connection (ExC) is proposed to connect each UPA layer In line with the same notion and the observations in [20], another direction to boost model learning and help information transport well among convolutional layers is building connections among a block, which is usually a unit with stacking of multiple convolutional layers and stacking as a layer module. To amplify the CPA effect, a hybrid connection with CPA is proposed. UPA blocks process lossless information from concatenating multiple UPA blocks in a stack and filter out vital information through residual connection. By this operation, the received information is not only from the last block but also the accumulating information until this block, so the CPA in each block can further absorb features from other blocks to amplify the receptive fields.
To transmit lossless learned information and feedback among layers, connecting each layer module is common in many object-detection tasks (e.g., saliency, semantic, and instance objects detection) within an auto-encoder-based structure in which the input and output share the exact image size with a bottleneck in a network. Nevertheless, the same picture is not seen in image classification. In contrast, the effect of merely applying residual and concatenating connections to prevent information loss will eventually saturate [21]. For this reason, the Extreme Connection (ExC) is proposed to connect each UPA layer module with learnable Spatial Pixel Attention (SPA) along with the existing Global Average Pooling in UPANets. As a result, instead of simply extending the connection to classification, a learnable global pooling in spatial direction, SPA, involves making sure that essential pixels occupy a significant portion in the sending of information to the output layer. A smooth updating landscape is expected to be generated to ensure robustness. This paper firstly discusses the essential background and motivations in Section 1, and then the well-known observations and relevant works toward image classification are introduced in Section 2. Next, the core of this work, the proposed method UPANets and its structures are introduced in Section 3. Then, to examine the proposed method, the comparisons between the proposed methods, UPANets, and other novel methods in well-known datasets can be observed in Section 4. Finally, the conclusion is in Section 5, and the extra findings and simulations of UPANets are summarized in the Appendices A-D. We share our implemented code at the link: https://github.com/hanktseng131415go/ UPANets (accessed on 17 July 2022).

Attentions
Attention exists in many forms. Motivated by Transformer [22], Multi-Head Attention served the purpose of considering global information, so the model becomes robust and powerful, but the draining computational resource breaks the balance between performance and used parameters. Although ViT [16] used the image patches to reduce the needed parameters, the problem remains. Later, the variants [17,18] with multiple Transformerbased units in a network make the draining problem severer. Apart from learning a mega dataset to make Transformer's Attentions useful, DeiT-B [23] used the proposed attention to transfer the pre-trained parameters into a Transformer on image classification. However, it only relocates the draining issue from end-to-end training to knowledge distillation. Despite this dilemma, big companies, such as Google, Microsoft, and Facebook, do not stop exploring more because of the path of considering global information in Transformers. Nonetheless, having a stack of GPUs is not common for most users, which drives the need to learn global information with fewer resources.
On the ground of applying Attentions in CNNs, the draining issue caused by Attentions is minor, but the existing Attentions mainly focus on local information or are limited toward paying attention in the current block. One of the most well-known Attention in CNNs is Convolutional Block Attention Module (CBAM) [24], which arbitrarily applies max and average pooling to care pixels. Although the Attentions in CBAM are parameter-less, the potential of losing information by max and average pooling remains. Similarly, SENets [25] uses global average pooling to squeeze the spatial information into one representative value. Then, it uses multi-layers perceptron (MLP) with a ReLU and Softmax to make channel attention. By embedding SE-Block after each block afterword, it showed improvements in VGGs [2], InceptionNets [3,4], and ResNeXts [26], but the same issue of losing information from arbitrarily average pooling is hovering. In object detection, operation toward a convolutional output to serve the purpose of pixel attention is also a trend. For example, DANet [27] is embedded behind a backbone network as a feature extractor (e.g., ResNets [5]) and applies two dot-products on the outputted feature maps with Softmax among the two Attentions in the Channel Attention Module, and the However, these modules only care about the feature maps from the final layer module. Additionally, GCNet [28] uses a similar module as DANet's in each block and replaces the dot-product operations with a 1 × 1 convolutional kernel, namely paying attention to the current block. The applying operation in GCNet with a 1 × 1 convolutional kernel did not outshine using a one-layer perceptron as a cross-channel attentional operation in our comparison in Appendix B. Another attention to learning cross channels in the current block is the shuffle operation from ShuffleNets v1 [29] and v2 [30]. Shuffling the order of CNNs kernel weight in groups breaches the independent learning process when detecting images in group CNNs, so the next layer of group CNNs can detect the other group CNNs feature maps. Nonetheless, the shuffle operation is taken back, and thus performance is limited; and please see Section 4.2.2. Unit Subtraction Convolution (USC) in RK-Net [31], similar to the parameter-free operation in ShuffleNets, replaces conventional dot-product with subtraction to extract the key points from the feature map. Another replacing traditional convolutional operation work is Local Pattern Network (LPN) [32], which proposes a feature partition strategy to take advantage of contextual features with the parameter-free operation. By viewing ShuffleNets, USC, and LPN, colouring CNNs operation with different mechanisms help the network to consider more information to perform better. In sum, inheriting a similar notion, our CPA brings a new direction to pay attention to across the channel. The mature feature maps are furtherly proved with better performance compared with the 1 × 1 convolutional layer, shuffle operation, and SENet in Section 4.

Structure Design
A good structure design could affect the performance and the parameters convert ratio because of being able to help information to transmit to different layers properly. ResNets have introduced residual connection that offers a great path to let deep learning fulfil the true meaning of deep, namely letting original information pass to deep layers intact. Additionally, the residual connection prevents the potential of facing overfitting. To explain the underlying reasons, the visualization of the loss landscape [20] has proven that. Another underlying reason in [20] is the dense connection in DenseNets [33]. Densely connection connects original and outputting information by reusing feature maps in deep layers. By observing landscapes from DenseNets in [20], the loss landscapes are smoother than ResNets. However, according to the statement from EfficientNet [21], the effect will be saturation despite using skip-connection or densely-connection. Most importantly, in our simulation of Section 4.3, the efficiency between parameters and accuracy degrades severely when layers in both ResNets and DenseNets grow. In our comparison in Section 4.3, UPANets has a better conversion ratio than the formers.

UPANets
This section details the proposed method, UPANets, and the relevant components. Firstly, the Attention approach designed channel-wise is revealed. Consequently, a hybrid block with residual and concatenation learning in UPA Block shows how they work together in UPANets. After the proposed SPA in ExC, the structure of UPANets is shown.

Channel Pixel Attention
A convolutional kernel is good at capturing local information, but each kernel can only detect a specific pattern, which limits the ability. Conventionally, stacking convolutional layers to expand the pattern library is intuitive but deadly, as discussed in Section 1. To expand the library without overstacking layers, having the ability to learn global information is vital. Therefore, the Channel Pixel Attention (CPA) is proposed. CPA applies a one-layer perceptron to pay attention to the pixel in the same position across channels. By fusing patterns across channels, the library can be expanded, and the detected patterns can also be more complex; please see Figure 1. The operation can be represented as (1): where c indicates the cth channel, X ∈ R N×C×W×H , x R c ∈ R N×W×H×C , which is reshaped to perform a dot product with W T c , W T c ∈ R N×C×C . After the pixel attention is processed by a one-layer perceptron, Batch Normalization and Layer Normalization with residual connection are applied afterwards. The workflow of CPA can be demonstrated in Figure 2, and the sample feature maps toward the inputs in an actual image with demonstration are shown in Figure 1, in which the outputted feature maps from the CPA process combine their original features and supportive information from others. These combined features show that CPA can promote feature maps to fuse more complex ones without losing original features. Compared with the deep structure, CPA can help a shallow network form more complex patterns, expanding receptive fields.
A convolutional kernel is good at capturing local information, but each kernel can only detect a specific pattern, which limits the ability. Conventionally, stacking convolutional layers to expand the pattern library is intuitive but deadly, as discussed in Section 1. To expand the library without overstacking layers, having the ability to learn global information is vital. Therefore, the Channel Pixel Attention (CPA) is proposed. CPA applies a one-layer perceptron to pay attention to the pixel in the same position across channels. By fusing patterns across channels, the library can be expanded, and the detected patterns can also be more complex; please see Error! Reference source not found.. The operation can be represented as (1): where indicates the cth channel, ∈ ℝ × × × , ∈ ℝ × × × , which is reshaped to perform a dot product with , ∈ ℝ × × .
After the pixel attention is processed by a one-layer perceptron, Batch Normalization and Layer Normalization with residual connection are applied afterwards. The workflow of CPA can be demonstrated in Error! Reference source not found., and the sample feature maps toward the inputs in an actual image with demonstration are shown in Error! Reference source not found., in which the outputted feature maps from the CPA process combine their original features and supportive information from others. These combined features show that CPA can promote feature maps to fuse more complex ones without losing original features. Compared with the deep structure, CPA can help a shallow network form more complex patterns, expanding receptive fields. Channel Pixel Attention Structure. The green and blue lines represent the original and processed information, respectively. In the stage of stride one, one CPA is involved with a concatenation. In the stage of stride two, a parallel CPA are processed with residual learning to decide essential to pass, and then a down-sampling is applied by avgpool2d.

UPA Blocks
As the discussion in Section 2 toward ResNets and DenseNets, it is crucial that a block not only processes image information well but can also pass lossless features not to waste the processed information from previous blocks. To achieve that, a hybrid combination is proposed for collecting the feature maps from the previous blocks by concatenating and filtering out essential feature maps to the next layer by residual learning. By Figure 2. Channel Pixel Attention Structure. The green and blue lines represent the original and processed information, respectively. In the stage of stride one, one CPA is involved with a concatenation. In the stage of stride two, a parallel CPA are processed with residual learning to decide essential to pass, and then a down-sampling is applied by avgpool2d.

UPA Blocks
As the discussion in Section 2 toward ResNets and DenseNets, it is crucial that a block not only processes image information well but can also pass lossless features not to waste the processed information from previous blocks. To achieve that, a hybrid combination is proposed for collecting the feature maps from the previous blocks by concatenating and filtering out essential feature maps to the next layer by residual learning. By concatenating, it can preserve original information and further help to amplify the CPA effect by learning not only the cross channels information in the current block among UPA layer modules, see Section 3.3. Furtherly, please, see the UPA Blocks structure in Figure 3.
Observing Figures 2 and 3, the difference between stride one and stride two is whether to use the concatenate operation or not. On the other hand, the residual connection is applied in CPA, which determines whether it should output the current learned information or the ones from the last block.
concatenating, it can preserve original information and further help to amplify the CPA effect by learning not only the cross channels information in the current block among UPA layer modules, see Section 3.3. Furtherly, please, see the UPA Blocks structure in Error! Reference source not found.. Observing Error! Reference source not found. and Error! Reference source not found., the difference between stride one and stride two is whether to use the concatenate operation or not. On the other hand, the residual connection is applied in CPA, which determines whether it should output the current learned information or the ones from the last block.

UPA Blocks
Continuing the discussion in Section 3.2, a model can preserve processed information from previous blocks by applying concatenation in UPA Block. After that, combining multiple UPA Blocks to form a layer module makes CPA pay attention to the vital pixel across channels from multiple blocks. Namely, through CPA, CNNs can access every processed feature in the layer module. If it comes to down-sampling, a parallel residual CPA involves deciding the important pixel from the accumulating feature maps to pass. See Error! Reference source not found.; UPA Layer Module helps CPA pay attention across channels throughout accumulating blocks.

UPA Blocks
Continuing the discussion in Section 3.2, a model can preserve processed information from previous blocks by applying concatenation in UPA Block. After that, combining multiple UPA Blocks to form a layer module makes CPA pay attention to the vital pixel across channels from multiple blocks. Namely, through CPA, CNNs can access every processed feature in the layer module. If it comes to down-sampling, a parallel residual CPA involves deciding the important pixel from the accumulating feature maps to pass. See Figure 4; UPA Layer Module helps CPA pay attention across channels throughout accumulating blocks.  In the UPA block 0, a stride two UPA block using the residual connection with 2 × 2 kernel average pooling is applied.
In Error! Reference source not found., except for the stride two version operation in block 0, each block follows the stride one version operation. Further, the width of every stride one version block is smaller than its input shape, and that can be referred to in the  In Figure 4, except for the stride two version operation in block 0, each block follows the stride one version operation. Further, the width of every stride one version block is smaller than its input shape, and that can be referred to in the following Equation: where b = 1 · · · n, W l indicates the summation of adding width of this layer, the width is the filter number or channel number, w b means the outputted width of this block, and w 0 equals to double width of the last layer because the original input remains, and the processed information is appended after that. For example, if the width of layer module 1 is set to 16, the outputted width of layer module 1 will be 32 because of concatenation. Therefore, the width of block 0 in layer module 2 is 32, w 0 = 32. Then, as the number of blocks in layer 2 is 4, b = 4, the width of each block is 8, w b = 8 because W 0 = 32 and 32 4 = 8. In this case, the outputted width from this UPA Block of the current UPA Layer Module will be 32 + 8 = 40.

Spatial Pixel Attention
Although Global Average Pooling (GAP) does not require extra computational cost, it is suffering the potential of losing information because of arbitrarily averaging out overall spatial information. To ameliorate this concern, this work proposes Spatial Pixel Attention (SPA) with learnable parameters by applying a one-layer perceptron to learn essential pixels in the same spatial direction. With the involved learnable process, SPA helps to determine which pixel to be amplified or ignore. SPA mechanism can be defined as the following formula: where c indicates the cth channel, X ∈ R N×C×1 , x R c ∈ R N×C×L , L = W × H, and W T c ∈ R N×L×1 . In Figure 5, the process from (b) to (c) is implemented by a one-layer perceptron. Through the layer, SPA can determine to pay the appropriate attention to the essential pixels and then squeeze the entire pixels into one-pixel information by a dot-product instead of arbitrary pooling with average.

Extreme Connection
Connecting the output layer with each inner layer in a network often generates a smooth landscape [20]. With a smooth landscape, the probability of having a robust result with many merits, such as quickly converging, arises. To do that, building such a connection would help. Here, an Extreme Connection (ExC) is proposed, which considers both the information from SPA and GAP. Error! Reference source not found. reveals the applied extreme connection, and this operation can be represented as the following: where ∈ ℝ × , which is the output from the flatten-concatenate . is the data num-

Extreme Connection
Connecting the output layer with each inner layer in a network often generates a smooth landscape [20]. With a smooth landscape, the probability of having a robust result with many merits, such as quickly converging, arises. To do that, building such a connection would help. Here, an Extreme Connection (ExC) is proposed, which considers both the information from SPA and GAP. Figure 6 reveals the applied extreme connection, and this operation can be represented as the following: where X ∈ R N×C , which is the output from the flatten-concatenate F. N is the data number, and C represents the number of channels. Additionally, b means the block th in a network. with = 8 as an example. Then, the process from (a) to (b) reshapes the convolutional image. From (b) to (c) is the SPA process, and its function is similar to the global average pooling.

Extreme Connection
Connecting the output layer with each inner layer in a network often generates a smooth landscape [20]. With a smooth landscape, the probability of having a robust result with many merits, such as quickly converging, arises. To do that, building such a connection would help. Here, an Extreme Connection (ExC) is proposed, which considers both the information from SPA and GAP. Error! Reference source not found. reveals the applied extreme connection, and this operation can be represented as the following: where ∈ ℝ × , which is the output from the flatten-concatenate . is the data number, and represents the number of channels. Additionally, means the block th in a network. As shown in Figure 6, ExC builds the relationship from the final hidden layer to the output of each block. In addition, SPA evaluates which pixel should be paid more attention toward the class to support GAP. Integrating both operations with layer normalization allows both sides' information to be scaled to the same level to learn.

UPANets Structure
In Figure 6, the cooperation between each proposed module is illustrated. The proposed CPA is applied among each UPA Block. Additionally, ExC is applied to connect every UPA Layer Module with the proposed SPA to cooperate with GAP. The detail transferring of size, width, and the proposing Attention in UPANets toward CIFAR-10 is presented in Table A1.

Experiment Environment Settings
This simulation implemented UPANets and is compared with CNNs-based SOTA models. The experimental environment comprises a customer-based GPU (RTX Titan with 24 GB) and an eight-core CPU (intel i9-9900KF) with 32 GB RAM. Despite the limitation of available hardware, although we cannot implement ImageNet to evaluate, this simulation Entropy 2022, 24, 1243 9 of 23 experiment compared UPANets and others in CIFAR-{10, 100} and tiny ImageNet datasets. Every training process was implemented in a cosine annealing learning schedule with a half cycle. Additionally, the training optimizer was stochastic gradient descent with an initial learning rate = 0.1, momentum = 0.9, and weight decay = 0.0005. A simple combination of data argumentation was applied with random crop in padding = 4, random horizontal flip, normalization in CIFARs and tiny ImageNet. As this simulation conducted a series of experiments with different epochs, the specific number of used epochs is revealed before each sub-section experiment description.
On the other hand, apart from mainly recording performance in accuracy (Top-1 Error), because we argue that finding a balance between performance and used resources is essential, efficiency is applied to examine the turnover rate throughout the experiments. This consideration shows that blindly chasing higher performance by adding parameters is irrational. The efficiency can be represented as the following Equation: where E represents the efficiency, P means the size of used parameters, and Acc is the abbreviation of the accuracy. Through Equation (5) above, it can learn whether this structure or setting can convert the parameters into performance efficiently, and it can also be recognized as the ratio of accuracy and parameters. For example, if two parameters contribute a 100% accuracy, the efficiency could be presented as E = 0.5. Additionally, if four parameters contribute another 100% accuracy, the efficiency could be presented as E = 0.25. Following the above examples, E = 0.5 is greater than E = 0.25, meaning higher efficiency.

Ablation Study
In this sub-section, we implemented a series of ablation comparisons toward different components among UPANets. The performance of UPANets with F = 16 in CIFAR-{10, 100} are revealed in the following comparisons, as "F" shows in Table A1, and each performance was recorded in the testing stage with the highest accuracy. The total number of epochs in this sub-section was set to 100, and the experiment setting followed the previous description in Section 4.1.

Global Fusion from Channel Pixel Attention
By Section 3.1, it is expected that CPA can promote CNNs to consider the global information of images as ViT [22], but CPA achieve that by only conducting a one-layer perceptron. By this one-layer perceptron, CPA only requires one-third of parameters compared with the Attentions in ViT with processing a Query, Key, and Value from three one-layer perceptrons every time.
In order to illustrate learned global information from CPA, Figure 7 is sampled from the first 32 feature maps from the CNNs in UPA Block 0 before the CPA in UPA Layer 2. Figure 7 contains three rectangles in green, orange, and red. It is evident that the green region from CNNs only detected a specific pattern of the kernel, and some kernels only detected background information. However, a feature map remains dim if the kernel cannot detect a feature. Most importantly, based on the concatenation in UPA Block and operation in the UPA layer module, although residual and concatenation are involved, CNNs still only detected specific patterns. A typical way to prevent dull outputs is adding more width to increase the pattern variety in CNNs, but it is a curse to ramp up more parameters. The above discussion explains the saturation of ResNets and DenseNets, despite residual and concatenation learning.
However, a feature map remains dim if the kernel cannot detect a feature. Most importantly, based on the concatenation in UPA Block and operation in the UPA layer module, although residual and concatenation are involved, CNNs still only detected specific patterns. A typical way to prevent dull outputs is adding more width to increase the pattern variety in CNNs, but it is a curse to ramp up more parameters. The above discussion explains the saturation of ResNets and DenseNets, despite residual and concatenation learning. Conversely, with the help of CPA, the orange area is immune to the issue in CNNs. Additionally, with UPA Blocks and UPA Layer Modules, the green region of the first 16th feature maps is from the root CNNs (CNNs in UPA Layer Module 0), 17th to 20th feature maps are from the root and UPA Block 0, etc. By CPA seeing feature maps from root CNNs to UPA Block 4, the outputs from CPA are gradually complex. That shows the capability of learning cross channels global information block to block and helps to expand the receptive field directly. Therefore, every feature map from CPA covers the learned information from itself to the others, so each pixel considers pixels located at the same position as others by learnable weights. Namely, the CPA can determine which pixel is helpful for consideration. Lastly, the samples of Conv + CPA possess the detected local patterns from Conversely, with the help of CPA, the orange area is immune to the issue in CNNs. Additionally, with UPA Blocks and UPA Layer Modules, the green region of the first 16th feature maps is from the root CNNs (CNNs in UPA Layer Module 0), 17th to 20th feature maps are from the root and UPA Block 0, etc. By CPA seeing feature maps from root CNNs to UPA Block 4, the outputs from CPA are gradually complex. That shows the capability of learning cross channels global information block to block and helps to expand the receptive field directly. Therefore, every feature map from CPA covers the learned information from itself to the others, so each pixel considers pixels located at the same position as others by learnable weights. Namely, the CPA can determine which pixel is helpful for consideration. Lastly, the samples of Conv + CPA possess the detected local patterns from the CNNs and conclude the global features from others. A sample of learned patterns in CNN and CPA by inputting noise can be seen in Appendix C.

Comparing with ShuffleNet
When we look at learning global information, it can be understood as sharing learned information with others. Under this notion, as the discussion toward ShuffleNets in 0, the shuffle operation is close to this idea. By shuffling the order of independently learned feature maps, the afterwards grouped CNNs have the chance to map to the pattern from the different groups. The groups in CNNs are dividing the channels (filters) into several independent groups to detect (e.g., channels = 16, groups = 2, which means they will be separated into two groups with 8 channels where each group will not share the learned parameters). Therefore, in Table 1, a comparison between CPA and the shuffle operation in ShuffleNets is evaluated under CNNs groups in two and four, respectively. The bold font indicate the best performance in the indicator.

Building Connection with Learnable Pooling
In Section 3.5 toward ExC, one of the reasons for introducing the connection is creating a smooth loss landscape to raise the potential for having a robust result. To verify this idea, the best approach is plotting the landscapes from loss and Top-1 Error (accuracy). Therefore, a series of landscape visualizations toward models with and without ExC in CIFAR-10 is conducted followingly. Additionally, we argued that arbitrarily GAP spatial information would suffer with the potential of losing important information. As a result, along with the visualizations, the proposed SPA participated in this simulation with performance evaluations in CIFAR-{10, 100} afterwards.
On the ground of visualizing landscapes, to make the loss of each competitor the same, we applied a min-max scaler to scale each loss into [0:1], and then we could compare the landform under the same standpoint. For Top-1 Error, as the scale is already in [0:1] in percentage, the scaling is skipped toward accuracy. Please see the landscapes toward scaled loss and Top-1 Error from Figure 8 to Figures 9 and 10 to Figure 11, separately.
Regarding the benefits of SPA, we have seen that it helps smooth the landscape. Another observable benefit, in Table 2, is performance boosting. Although, compared Final SPA with Final GAP, the performance increased in both CIFAR-{10, 100}, the winnings are reversed when cooperating with ExC. Whereas the non-absolute improvement of working with ExC, the improvement happened while ExC and GAP worked together. A more significant improvement is also seen in having ExC, SPA, and GAP together, in the bold fonts. Given the most significant improvements in performance and landscapes, we opt for ExC + SPA + GAP with proposed methods to form UPANets.
Entropy 2022, 24, x FOR PEER REVIEW 12 of 2 scaled loss and Top-1 Error from Error! Reference source not found. to Error! Reference source not found. and Error! Reference source not found. to Error! Reference source no found., separately.   scaled loss and Top-1 Error from Error! Reference source not found. to Error! Reference source not found. and Error! Reference source not found. to Error! Reference source not found., separately.    Regarding the benefits of SPA, we have seen that it helps smooth the landscape. Another observable benefit, in Error! Reference source not found., is performance boosting. Although, compared Final SPA with Final GAP, the performance increased in both CIFAR-{10, 100}, the winnings are reversed when cooperating with ExC. Whereas the non-absolute improvement of working with ExC, the improvement happened while ExC and GAP worked together. A more significant improvement is also seen in having ExC, SPA, and GAP together, in the bold fonts. Given the most significant improvements in performance and landscapes, we opt for ExC + SPA + GAP with proposed methods to form UPANets.  Regarding the benefits of SPA, we have seen that it helps smooth the landscape. An other observable benefit, in Error! Reference source not found., is performance boosting Although, compared Final SPA with Final GAP, the performance increased in both CIFAR-{10, 100}, the winnings are reversed when cooperating with ExC. Whereas the non-absolute improvement of working with ExC, the improvement happened while ExC and GAP worked together. A more significant improvement is also seen in having ExC SPA, and GAP together, in the bold fonts. Given the most significant improvements in performance and landscapes, we opt for ExC + SPA + GAP with proposed methods to form UPANets.

Comparison with SOTAs
After evaluating a range of proposed components, these vital parts form UPANets, and it is vital to compare them with existing SOTAs. In UPANets, setting F = 16, 32, and 64 as the channel number base represents different widths of UPANets. Using these variant width UPANets with existing CNNs-based models in CIFAR-{10, 100} as former simulations, we can see a much clearer place among SOTAs. Additionally, because of the hardship of being unable to evaluate on ImageNet, a Tiny ImageNet is chosen as an alternative. In the following comparison, the models are reimplemented based on the work in the link (https://github.com/kuangliu/pytorch-cifar accessed on 23 October 2020) following the experiment setting in Section 4.1, except for setting epochs in 200.

Comparison in CIFARs
In this comparison, the performance of each model was recorded in accuracy toward testing data, parameters size in million, and efficiency in Equation (5) with the best performance in the bold fonts in tables. As there are three performance indexes in Table 3, it presents the information in a scatter plot as Figure 12, which contains accuracy on the y-axis and efficiency on the x-axis. The size of the circle toward each model represents the parameter size in a million. The same policies apply to Table 4 and Figure 13.       In this implemented CIFAR-10 comparison, UPANet64 has the best accuracy. By plotting each model in Figure 12, UPANets have outstanding performance-balancing efficiency and accuracy in the scatter plot. In addition, the models claimed in the lite structure are located in the bottom right area, but they lost certain accuracy. On the other side, UPANet16 and DenseNet are located in the upper right corner, indicating that the proposed model and DenseNets have high efficiency. As for the accuracy in Table 3, UPANet64 is the only model reaching over 96% accuracy without many parameters, especially compared with ResNet101 and DenseNet201. A similar overall distribution toward the three indexes is witnessed in implemented CIFAR-100 comparison. Although UPANet16 and UPANet32 are falling behind in terms of efficiency, UPANet64 is the one which passes the 80% accuracy in CIFAR-100. As a result, UPANets performed well in both open datasets from the evaluated points.

Comparison in Tiny ImageNet
Although we compared a series of SOTAs with UPANets in CIFAR-{10, 100}, the difficulty of datasets is smaller than Tiny ImageNet, as it needs to classify more labels, which is about double that of CIFAR-100. Moreover, the image size is two times larger than CIFARs, so we only examined UPANets64 in 100 epochs with the same experimental setting as the above comparisons. Further, some SOTAs, which were also examined on Tiny ImageNet, are shown together in Table 5. As a whole, UPANets has not only performed excellently in widely-used datasets but also in a complex dataset, in this Tiny ImageNet. Moreover, based on classification performance, the proposed UPANets can be one of the state-of-the-art models in the Tiny ImageNet benchmark (Checked on April 2021).

Conclusions
This work proposed a novel backbone, UPANets, for image classification. Each proposed component in the framework fulfils specific objectives and helps the model outshine existing SOTAs in terms of performance and efficiency. The positive findings and potential contributions can be concluded as follows.

CPA in Processing Global Information with Benefits
First, CPA captures global information across channels to form more complex feature maps, expanding the receptive fields of shallow layers. That is, the shallow layers will quickly mature to boost performance. On the other hand, the more mature layers indicate fewer needs for stacking deep. With further application of concatenation in UPA blocks with accumulating UPA layer modules, the effect is amplified more to ramp up the advantages.

SPA with ExC Brings Better Environments for Learning
Connecting each layer, transporting essential spatial information by learnable attention brings smoother landscapes. As the concern of losing information by arbitrarily averaging out spatial pixels, SPA ameliorates it with performance improvements. Moreover, ExC learned that passing feedback from SPA to each layer forms a smooth landform.

SPA with ExC Brings Better Environments for Learning
Finally, comparing with a series of SOTAs in CIFAR-{10, 100} and Tiny ImageNet, the results of UPANets are better than most existing SOTAs. As a result, it is convinced that UPANets can perform competitively in image classification. Further, this practical evidence shows that learning universal pixels channel-wise and spatial-wise with the proposed modules can effectively utilize parameters.
In sum, these attempts create a way to develop an efficient backbone for effectively processing universal information with decent performance.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Dimension Illustration in UPANets Structure
In the Table A1 for UPANets structure in CIFAR-10, N represents the data number, F indicates the filters number, B i are blocks, d means the depth multiplier, b is the number of the block, and w is the convolutional width. UPA Block 0 and the other blocks follow the stride 2 version and stride 1 version UPA block, respectively.

Appendix B. Comparison of Perceptron and CNNs in Attention
In Section 3.1, we bring a cross Channels Pixel Attention (CPA) mechanism. A onelayer perceptron is applied to offer the service in CPA. Additionally, 1 × 1 CNNs is a standard option to map the information across channels. However, as the simulation in Table A2 shows, the performances of CNNs have fallen behind using one-layer perceptron, one-layer perceptron in bold fonts. The underlying reason could be that although CNNs can share patterns, the single parameter in each sharing pattern is limited to carrying on vital information. Additionally, as the one-layer perceptron is operating in dot-product, the information is shared and combined with each one, indicating our point of CPA detecting cross channels in the same pixel position.

Appendix C. Sample Pattern of the CNN and CPA in UPA Block
CPA paying attention under the operation of UPA blocks among UPA layer modules, CPA can learn cross-channels-blocks pixel to form universal attention. That is contributed by concatenation in UPA blocks and accumulating of UPA layer module. To observe the learned patterns from the global range, inputting a random noise to extract pattern profiles was conducted in Figure A1, with the same extracting policy as Figure 7.

Appendix D. Landscape toward UPANets and Others
The introduction of the visualizing loss landscape method in [20] helps researchers understand the possible training landscape among the parameters of a model. By the description of the actual implementing source code (https://github.com/tomgoldstein/losslandscape, https://github.com/JoelNiklaus/loss_landscape accessed on 20 April 2021), the primary usage is setting a random sampling range in [−1:1] with a specific sampling number (default: 50). However, as this sampling method is similar to the sensitivity analysis in determining feature importance, only a good sampling range can produce a calculatable loss. This dilemma impeded us when we were trying to visualize a sensitive model, such as DenseNets, because a little adding noise might cause the loss to Nan. Therefore, how to define a good sampling range is a challenge. On the other hand, although filter normalization has been introduced [20] to compare loss landscapes from different models, we found that different loss ranges still make comparing hard. An enormous total loss range will make most landscapes smother because an outlier will break the harmony of the loss map.
Toward the dilemmas, we ushered an automatic search and min-max scaled into our visualization. First, a doable visualization range with binary search is applied in advance based on the original method from [−1:1]. Later, we used min-max scaling for every loss landscape to make the two landscapes comparable. Finally, for demonstrating, we endto-end trained DenseNets and our models in CIFAR-10 based on the code in this project (https://github.com/kuangliu/pytorch-cifar accessed on 23 October 2020) and applied the ushered methods in the following pre-and post-scaled landscapes.

Appendix D.1. Comparison with DenseNet
The visualizable sampling range was 0.0375: 0.0375 with 50 samples. In Figure   A2, the largest loss broke the harmony of the original loss landscape. The normal loss owns the majority number, but it is hard to see the fluctuation of the landscape from the relative more minor loss. As a result, a flattened space created an illusion. Min-max scaled loss landscape shows a much different view. Although the centre of the map is still flat, Figure A1. Samples of Fusion Feature Maps in UPANets with Noise.

Appendix D. Landscape toward UPANets and Others
The introduction of the visualizing loss landscape method in [20] helps researchers understand the possible training landscape among the parameters of a model. By the description of the actual implementing source code (https://github.com/tomgoldstein/losslandscape, https://github.com/JoelNiklaus/loss_landscape accessed on 20 April 2021), the primary usage is setting a random sampling range in [−1:1] with a specific sampling number (default: 50). However, as this sampling method is similar to the sensitivity analysis in determining feature importance, only a good sampling range can produce a calculatable loss. This dilemma impeded us when we were trying to visualize a sensitive model, such as DenseNets, because a little adding noise might cause the loss to Nan. Therefore, how to define a good sampling range is a challenge. On the other hand, although filter normalization has been introduced [20] to compare loss landscapes from different models, we found that different loss ranges still make comparing hard. An enormous total loss range will make most landscapes smother because an outlier will break the harmony of the loss map.
Toward the dilemmas, we ushered an automatic search and min-max scaled into our visualization. First, a doable visualization range with binary search is applied in advance based on the original method from [−1:1]. Later, we used min-max scaling for every loss landscape to make the two landscapes comparable. Finally, for demonstrating, we endto-end trained DenseNets and our models in CIFAR-10 based on the code in this project (https://github.com/kuangliu/pytorch-cifar accessed on 23 October 2020) and applied the ushered methods in the following pre-and post-scaled landscapes.

Appendix D.1. Comparison with DenseNet
The visualizable sampling range was [−0.0375 : 0.0375] with 50 samples. In Figure A2, the largest loss broke the harmony of the original loss landscape. The normal loss owns the majority number, but it is hard to see the fluctuation of the landscape from the relative more minor loss. As a result, a flattened space created an illusion. Min-max scaled loss landscape shows a much different view. Although the centre of the map is still flat, the surrounding loss stands erect on edge. Not only can the scaled landscape reveal a much more reasonable profile, but scaling can also make different landscapes comparable. Therefore, the exact search and scaled policies were applied to UPANet16 in [−0.0375 : 0.0375] to compare with DenseNet in Figure A3 and the Top-1 Error ones in Figure A4. the surrounding loss stands erect on edge. Not only can the scaled landscape reveal a much more reasonable profile, but scaling can also make different landscapes comparable. Therefore, the exact search and scaled policies were applied to UPANet16 in 0.0375: 0.0375 to compare with DenseNet in Figure A3 and the Top-1 Error ones in Figure A4.   Comparing Figure A2 to Figure A4 in the range 0.0375: 0.0375 , UPANet can distribute a similar view as DenseNet, but there are fewer enormous losses at the edge of the the surrounding loss stands erect on edge. Not only can the scaled landscape reveal a much more reasonable profile, but scaling can also make different landscapes comparable. Therefore, the exact search and scaled policies were applied to UPANet16 in 0.0375: 0.0375 to compare with DenseNet in Figure A3 and the Top-1 Error ones in Figure A4.   Comparing Figure A2 to Figure A4 in the range 0.0375: 0.0375 , UPANet can distribute a similar view as DenseNet, but there are fewer enormous losses at the edge of the the surrounding loss stands erect on edge. Not only can the scaled landscape reveal a much more reasonable profile, but scaling can also make different landscapes comparable. Therefore, the exact search and scaled policies were applied to UPANet16 in 0.0375: 0.0375 to compare with DenseNet in Figure A3 and the Top-1 Error ones in Figure A4.   Comparing Figure A2 to Figure A4 in the range 0.0375: 0.0375 , UPANet can distribute a similar view as DenseNet, but there are fewer enormous losses at the edge of the Comparing Figure A2 to Figure A4 in the range [−0.0375 : 0.0375], UPANet can distribute a similar view as DenseNet, but there are fewer enormous losses at the edge of the landscape. Especially, it seems models can quickly reach a minimum in UPANet16 Top-1 Error map with a lower gap in the margin.
DenseNets. This also implies the robustness of UPNets toward the noise, as the method of [20] is sampling parameters from a different angle, like adding noise to see the loss changing. So, the sensitive changes formed the landforms we obtained. The scaled ones have been shown in Section 4.2.3. Figures from Figure A5 to Figure A6 show the pre-scaled landscapes for each variant of UPANet16.
landscape. Especially, it seems models can quickly reach a minimum in UPANet16 Top-1 Error map with a lower gap in the margin.

Appendix D.2. UPANet16 Variants Original Landscapes
In our visualizations in UPANet16 and its variants, using the default range of [−1 :1] can already offer the visualization, which indicates UPANets are not as sensitive as Dense-Nets. This also implies the robustness of UPNets toward the noise, as the method of [20] is sampling parameters from a different angle, like adding noise to see the loss changing. So, the sensitive changes formed the landforms we obtained. The scaled ones have been shown in Section 4.2.3. Figures from Figure A5 to Figure A6 show the pre-scaled landscapes for each variant of UPANet16.