FSRFNet: Feature-selective and Spatial Receptive Fields Networks

ﬁelds, Abstract: The attention mechanism plays a crucial role in the human visual experience. In the cognitive neuroscience community, the receptive ﬁeld size of visual cortical neurons is regulated by the additive e ﬀ ect of feature-selective and spatial attention. We propose a novel architectural unit called a “Feature-selective and Spatial Receptive Fields” (FSRF) block that implements adaptive receptive ﬁeld sizes of neurons through the additive e ﬀ ects of feature-selective and spatial attention. We show that FSRF blocks can be inserted into the architecture of existing convolutional neural networks to form an FSRF network architecture, and test its generalization capabilities on di ﬀ erent datasets.

In addition to deep learning and object detection methods, previous research has studied the importance of attention [23][24][25]. We focus on the interaction between feature-selective and spatial attention and the impact on receptive fields (RFs). The classical RF (CRF) of neurons in the V1 region was discussed in [26]. Researchers have proposed that in such a visual zone as the V4 region, attention should be paid to the effects of neuronal discharge rates in two ways [27]. One is the input gating model. In this model, the RF of the mediators in the theory corresponds to the stimulus that is noted or ignored in the field of view. Another theory is the neuronal strobing model, which states that neurons in the V4

Attention Mechanisms
Evidence from the human perception process [23] illustrates the importance of the attention mechanism. Attention is generally divided into two types: top-down conscious attention and bottom-up unconscious attention. It is worth noting that the mechanism uses top-down information to guide the bottom-up feed-forward process, which biases the distribution of the most informative feature expressions while suppressing the less useful [41][42][43]. Attention is the ability to focus perception on a stimulus while ignoring other potential stimuli. It has proven to be one of the most interesting areas of research in cognitive neuroscience. The benefits of the attention mechanism have been demonstrated in many tasks, including neuro-machine translation [44], image subtitles [45], and lip reading [46]. In order to enhance the performance of image classification, some researchers have introduced attention mechanisms into their network models [33,47,48]. In Brad Motter's neurophysiological study of attention in macaque monkeys [49], all regions of the visual cortex showed a greater sensitivity to attention when the target stimulus was presented with competing stimuli, compared to the condition when the target stimulus was presented alone. Another study of macaques similarly found that attention regulated sensory responses in areas V2 and V4 primarily when two or more simultaneous stimuli competed for access to a neuron's receptive field; no such effect was observed when only a single stimulus was inside the receptive field [50]. An fMRI study found that selective attention in humans modulates neural activity in the visual cortex through top-down biasing signals [48]. When a monkey or a person is asked to pay attention to a particular area of space, corresponding increases in neural activity can be observed in V1, V2, and V4 [49][50][51]. Such activation enhancement is described as a "gain" of the sensory process [52] or as an increase in neuronal sensitivity. [29][30][31][32] show that there is a small but significant interaction between feature-selective and spatial attention. Recent research [53] suggests that feature-based attention operates in a spatially global manner throughout the field of view, and that the spatial focus of attention should be carefully controlled to maximize the global feature-based attention effect.

Feature-selective and Spatial Receptive Fields (FSRF) Blocks
In order to implement adaptive receptive field (RF) sizes of neurons through the additive effects of feature-selective and spatial attention, we propose a novel architectural unit called a "Feature-selective and Spatial Receptive Fields" (FSRF) block. The structure of the FSRF block is depicted in Figure 1. We implement the FSRF block via three operations: Multi-branch Convolution, Fuse, and Interactions between Feature-selective and Spatial Attention. Evidence from the human perception process [23] illustrates the importance of the attention mechanism. Attention is generally divided into two types: top-down conscious attention and bottom-up unconscious attention. It is worth noting that the mechanism uses top-down information to guide the bottom-up feed-forward process, which biases the distribution of the most informative feature expressions while suppressing the less useful [41][42][43]. Attention is the ability to focus perception on a stimulus while ignoring other potential stimuli. It has proven to be one of the most interesting areas of research in cognitive neuroscience. The benefits of the attention mechanism have been demonstrated in many tasks, including neuro-machine translation [44], image subtitles [45], and lip reading [46]. In order to enhance the performance of image classification, some researchers have introduced attention mechanisms into their network models [33,47,48]. In Brad Motter's neurophysiological study of attention in macaque monkeys [49], all regions of the visual cortex showed a greater sensitivity to attention when the target stimulus was presented with competing stimuli, compared to the condition when the target stimulus was presented alone. Another study of macaques similarly found that attention regulated sensory responses in areas V2 and V4 primarily when two or more simultaneous stimuli competed for access to a neuron's receptive field; no such effect was observed when only a single stimulus was inside the receptive field [50]. An fMRI study found that selective attention in humans modulates neural activity in the visual cortex through top-down biasing signals [48]. When a monkey or a person is asked to pay attention to a particular area of space, corresponding increases in neural activity can be observed in V1, V2, and V4 [49][50][51]. Such activation enhancement is described as a "gain" of the sensory process [52] or as an increase in neuronal sensitivity. [29][30][31][32] show that there is a small but significant interaction between feature-selective and spatial attention. Recent research [53] suggests that feature-based attention operates in a spatially global manner throughout the field of view, and that the spatial focus of attention should be carefully controlled to maximize the global feature-based attention effect.

Feature-selective and Spatial Receptive Fields (FSRF) Blocks
In order to implement adaptive receptive field (RF) sizes of neurons through the additive effects of feature-selective and spatial attention, we propose a novel architectural unit called a "Feature-selective and Spatial Receptive Fields" (FSRF) block. The structure of the FSRF block is depicted in Figure 1. We implement the FSRF block via three operations: Multi-branch Convolution, Fuse, and Interactions between Feature-selective and Spatial Attention.

AMC attention building block AMS attention building block
Interactions between feature-selective and spatial Attention

Multi-branch Convolution
The goal of using multi-branch convolution is to provide various filters for multiple branches, ultimately aggregating more informative and multifarious features. For any given feature map × convolution kernel is used in the

Multi-Branch Convolution
The goal of using multi-branch convolution is to provide various filters for multiple branches, ultimately aggregating more informative and multifarious features. For any given feature map X ∈ R H ×W ×C , firstly we conduct two transformations F tr : X → U ∈ R H×W×C and F tr : X → U ∈ R H×W×C . Note that F tr and F tr are composed of efficient convolutions, with batch normalization and ReLU functioning in sequence. A 3 × 3 convolution kernel is used in the transformation F tr , and a 5 × 5 convolution kernel is used in the transformation F tr . H, W, and C denote the height, width, and number of channels of the feature map, respectively. Let W c = [w 1 , w 2 , . . . , w c ] and O c = [o 1 , o 2 , . . . , o c ] denote the learned set of 3 × 3 and 5 × 5 convolution kernels, where W c and O c refers to the parameters of the corresponding c-th convolution kernel. We can then write the outputs of F tr and F tr as Here * denotes convolution, w c = w s 1 , w s 2 , . . .

Fuse
Our goal is to enable neurons to adaptively adjust their RF sizes through the additive effects of feature-selective and spatial attention. The basic idea is to use two gates from the average and max channel (AMC) and average and max spatial (AMS) attention building blocks to control the flow of multiple branches carrying different scales of information into neurons in the next layer. We first combine the results of multiple branches (such as the two shown in Figure 1) by summing the elements, as follows: We then input the feature map obtained from the previous step into the AMC and AMS attention building blocks, and flexibly select different information-space scales under the guidance of compact feature descriptors.
The structure of the AMC attention building block is depicted in Figure 2. The AMC attention building block generates a channel attention map by using the inter-channel relationship of features. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map into a channel descriptor by using global average pooling and global max pooling. We describe the detailed operation below.

Fuse
Our goal is to enable neurons to adaptively adjust their RF sizes through the additive effects of feature-selective and spatial attention. The basic idea is to use two gates from the average and max channel (AMC) and average and max spatial (AMS) attention building blocks to control the flow of multiple branches carrying different scales of information into neurons in the next layer. We first combine the results of multiple branches (such as the two shown in Figure 1) by summing the elements, as follows: We then input the feature map obtained from the previous step into the AMC and AMS attention building blocks, and flexibly select different information-space scales under the guidance of compact feature descriptors.
The structure of the AMC attention building block is depicted in Figure 2. The AMC attention building block generates a channel attention map by using the inter-channel relationship of features. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map into a channel descriptor by using global average pooling and global max pooling. We describe the detailed operation below. The preprocessed feature map U passes through two branches of the AMC attention building block. The first branch uses global average pooling to generate channel-wise statistics. Finally, a statistic z C ∈  is generated by shrinking U through its spatial dimensions H W × , such that the c-th element of z is calculated by:  The preprocessed feature map U passes through two branches of the AMC attention building block. The first branch uses global average pooling to generate channel-wise statistics. Finally, a statistic z ∈ R C is generated by shrinking U through its spatial dimensions H × W, such that the c-th element of z is calculated by:

( )
where F ga (u c ) indicates the global average pooling operator. Further, in order to take advantage of the information aggregated in the global average pooling, we then conduct a second operation, the purpose of which is to make full use of the dependencies between different feature maps. In order to achieve this effect, we use a dimensionality-reduction layer with parameters T 1 and reduction ratio r, a ReLU layer, and a dimensionality-increasing layer with parameters T 2 . The fully connected layers are used in the dimensionality-reduction layer and dimensionality-increasing layer. The average attention of the channel is computed as: where δ refers to the ReLU function, T 1 ∈ R C r ×C , and T 2 ∈ R C× C r . The second branch uses global max pooling to generate channel-wise statistics. A statistic z ∈ R C is generated by shrinking U through its spatial dimensions H × W, such that the c-th element of z is calculated by: where F gm (u c ) indicates the global average pooling operator.
Additionally, we conduct a second operation in order to take advantage of the information aggregated in the global max pooling, the purpose of which, as with the first branch, is to make full use of the dependencies between different feature maps. The maximum attention of the channel is computed as: Finally, a multiplication activation function is used to process the feature information of the two branch outputs: where s ∈ R C . The F add (s, s) indicates the channel-wise summation between s and s. The structure of the AMS attention building block is depicted in Figure 3. The role of the AMS attention building block is to produce a spatial attention map by exploiting the inter-spatial relationship of features. we first apply global average pooling and global max pooling operations along the channel axis to generate an efficient feature descriptor, respectively, and then concatenate the previous two feature maps together. Based on the concatenated feature descriptors, we use a convolution layer to generate a spatial attention map. In order to use the spatial attention map by gated operation, we apply a convolution layer with c channels and a global average pooling operation after the last convolution layer. We describe the detailed operation below.
Appl. Sci. 2019, 9, x 5 of 17 Further, in order to take advantage of the information aggregated in the global average pooling, we then conduct a second operation, the purpose of which is to make full use of the dependencies between different feature maps. In order to achieve this effect, we use a dimensionality-reduction layer with parameters T and reduction ratio r, a ReLU layer, and a dimensionality-increasing layer with parameters T . The fully connected layers are used in the dimensionality-reduction layer and dimensionality-increasing layer. The average attention of the channel is computed as: where δ refers to the ReLU function, 1 T The second branch uses global max pooling to generate channel-wise statistics. A statistic  z C ∈  is generated by shrinking U through its spatial dimensions H W × , such that the c-th indicates the global average pooling operator.
Additionally, we conduct a second operation in order to take advantage of the information aggregated in the global max pooling, the purpose of which, as with the first branch, is to make full use of the dependencies between different feature maps. The maximum attention of the channel is computed as: Finally, a multiplication activation function is used to process the feature information of the two branch outputs: The structure of the AMS attention building block is depicted in Figure 3. The role of the AMS attention building block is to produce a spatial attention map by exploiting the inter-spatial relationship of features. we first apply global average pooling and global max pooling operations along the channel axis to generate an efficient feature descriptor, respectively, and then concatenate the previous two feature maps together. Based on the concatenated feature descriptors, we use a convolution layer to generate a spatial attention map. In order to use the spatial attention map by gated operation, we apply a convolution layer with c channels and a global average pooling operation after the last convolution layer. We describe the detailed operation below.  For a preprocessed feature map U, firstly we conduct two transformations H mean : U → V ∈ R H×W×1 and H max : U → V ∈ R H×W×1 . These are connected together to create the spatial attention map M ∈ R H×W×2 : The map is then convoluted by a 7 × 7 filter to produce a 2D spatial attention map M ∈ R H×W : The 2D spatial attention map generated in the previous step, then resulting in a multidimensional spatial attention map M ∈ R H×W×C : where F 1×1 represents a convolution operation with the filter size of 1 × 1, F 7×7 represents a convolution operation with the filter size of 7 × 7, and 'cat' denotes the concatenate function. H mean (U) and H max (U) refer to global average pooling and global max pooling operations along the corresponding channel axis. We then use global average pooling to generate channel-wise statistics. A statistic n ∈ R C is generated by shrinking M through its spatial dimensions H × W, such that the c-th element of n is calculated by:

Interactions between Feature-selective and Spatial Attention
In order to take advantage of the information aggregated in the AMC and AMS attention building blocks, we conduct a soft attention across channels that achieves interactions between feature-selective and spatial attention. Firstly, a SoftMax operator is applied on the channel-wise digits at the output of the AMC building block: Similarly, a SoftMax operator is applied on the channel-wise digits at the output of the AMS building block: where A, B, J, K ∈ R C× C r , a and c denote the vector for U, and b and d denote the vector for U. Note that a c is the c-th element of a and A c ∈ R 1× C r is the c-th row of A; likewise for b c , B c , c c , J c , d c and K c . A simple sigmoid operator is applied on the channel-wise digits at the output of the AMC building block: Similarly, a simple sigmoid operator is applied on the channel-wise digits at the output of the AMS building block: The feature maps Y,Ŷ and Y are obtained by rescaling the transformation output U, U and U with the activations: where Y = Y 1 , Y 2 , . . . , Y c , Y c ∈ R H×W and likewise forŶ and Y. F mul (u c , a c , c c ) refers to channel-wise multiplication between the scalar a c , c c and the feature map u c , and likewise for F mul (u c , e c , f c ) and F mul ( u c , b c , d c ). The final feature map Y is obtained by the element-wise summation of the vectors Y,Ŷ and Y:

Instantiation
The FSRF block can be integrated into a standard architecture such as ResNet [6] with a non-linear insertion after each convolution. In addition, the flexibility of the FSRF block means that it can be applied directly to conversions beyond standard convolution.
Here, FSRF blocks are used with residual modules. By making this change to each such module in the architecture, we can obtain an FSRF-ResNet network. Figure 4 depicts the schema of an FSRF-ResNet module. Further variants that integrate FSRF blocks with ResNeXt [9], ShuffleNetV2 [38], and MobileNetV2 [39] can be constructed by following similar schemes, as discussed below. where

Instantiation
The FSRF block can be integrated into a standard architecture such as ResNet [6] with a non-linear insertion after each convolution. In addition, the flexibility of the FSRF block means that it can be applied directly to conversions beyond standard convolution.
Here, FSRF blocks are used with residual modules. By making this change to each such module in the architecture, we can obtain an FSRF-ResNet network. Figure 4 depicts the schema of an FSRF-ResNet module. Further variants that integrate FSRF blocks with ResNeXt [9], ShuffleNetV2 [38], and MobileNetV2 [39] can be constructed by following similar schemes, as discussed below.

Network Architecture
An FSRF network (FSRFNet) can be constructed by simply stacking a set of FSRF blocks. For concrete examples of FSRFNet architectures, a detailed description of FSRF-50 is presented in Table 1. It is recommended that FSRFNet consists primarily of a bunch of duplicate bottlenecks called "FSRF units," in a similar fashion to ResNeXt [9]. Each FSRF unit consists of a series of 1 × 1 convolutions, FSRF blocks, and further 1 × 1 convolutions. In ResNeXt [9], large kernel convolutions in all original bottleneck blocks are replaced by the proposed FSRF blocks. FSRF-50 uses {3, 4, 6, 3} FSRF units. Table 1 shows a 50-layer FSRFNet-50 architecture with four phases, using {3, 4, 6, 3} FSRF units. Different architectures can be obtained by changing the number of FSRF units per stage.  In the FSRF units of the FSRF-50, the reduction ratio of the number of parameters in the control fuse operator determines the final setting of the FSRF block. There are two important hyperparameters that determine this final setting: the group number G that controls the cardinality of each path and the reduction ratio r of the number of parameters in the control fuse operator. In Table 1, we set the reduction ratio r = 16 and cardinality G = 32.

Experiments
In this section, we conduct experiments to study the effectiveness of the FSRF block in a range of tasks, datasets, and model architectures. To benchmark, we compare single crop top-1 performance on datasets of different sizes.

Tiny ImageNet Classification
Tiny ImageNet [54] has 200 classes. Each class has 500 training images, 50 validation images, and 50 test images. We train the network on the training set and report the top-1 errors on the validation set. For data enhancement, we follow standard practices and perform random size cropping to 224 × 224 and random horizontal flipping [2]. We use synchronous SGD with a momentum of 0.9, a mini-batch size of 32, and a weight decay of 1 × 10 −4 . The initial learning rate is set to 0.5 and decreased by a factor of 10 every 30 epochs. All models are trained for 100 epochs from scratch on one GPU, using the weight initialization strategy in [55]. We first compare FSRFNet-50 and FSRFNet-101 with a publicly competitive model of similar complexity. The results show that the FSRF block is consistent in improving the performance of state-of-the-art attention-based CNNs.
We begin by comparing FSRFNets to the public competitive models with different depths. Table 2 and Figure 5 show the comparison results on the Tiny ImageNet [54] validation set. As the illustrations show, FSRFNet-50 and FSRFNet-101 improve the performance of state-of-the-art attention-based network models at different depths compared to models of similar complexity. FSRFNet-50 and FSRFNet-101 achieve performance improvements of 5.1% and 5.5% over ResNeXt-50 and ResNeXt-101, respectively. In addition, FSRFNet-50 and FSRFNet-101 achieve performance increases of 2.9% and 3.7% compared to SENet-50 and SENet-101, respectively. We note also that gains of 1.5% and 1.7% can be obtained for FSRFNet-50 and FSRFNet-101, compared to SKNet-50 and SKNet-101, respectively. Surprisingly, FSRFNet-50 is not only 1.27% higher than the absolute accuracy of ResNeXt-101, but the parameters and calculations of FSRFNet-50 are 22% and 27% smaller than ResNeXt-101, respectively, which demonstrates the superiority of the additive effects of feature-selective and spatial attention. Table 2. Single 224 × 224 cropped top-1 error rates (%) on the Tiny ImageNet validation set and complexity comparisons. SENet, SKNet, and FSRFNet are all based on the corresponding ResNeXt backbones. The definition of FLOPs follows [56], i.e., the number of floating-point multiplication-adds. and #P denotes the number of parameters.  Additionally, we choose the representative compact architecture of ShuffleNetV2 [38] and MobileNetV2 [39], which represents one of the strongest lightweight models, to evaluate the generalization capabilities of FSRF blocks. For comparison, SE, SK and FSRF blocks are embedded in ShuffleNetV2 [38] and MobileNetV2 [39]. Similar to [56], the number of channels in each block is scaled to generate networks of different complexities, marked as 0.5×, 0.75×, and 1×.

ShuffleNetV2
#P MFLOPs top-1 err. (%) Figure 5. Relationship between the performance of FSRFNet and the number of its parameters, compared with the corresponding ResNeXt backbones.
Additionally, we choose the representative compact architecture of ShuffleNetV2 [38] and MobileNetV2 [39], which represents one of the strongest lightweight models, to evaluate the generalization capabilities of FSRF blocks. For comparison, SE, SK and FSRF blocks are embedded in ShuffleNetV2 [38] and MobileNetV2 [39]. Similar to [56], the number of channels in each block is scaled to generate networks of different complexities, marked as 0.5×, 0.75×, and 1×.

CIFAR Classification
To further evaluate the performance of FSRFNets, we conduct experiments on CIFAR-10 and CIFAR-100 [57]. The CIFAR-10 [57] dataset consists of 60,000 32 32 × color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR-100 [57] dataset resembles the CIFAR-10 [57], except that it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 [57] are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
We use the same approach as above to integrate FSRF blocks into several popular baseline frameworks (ResNeXt-29 [9], ShuffleNetV2 [38], and MobileNetV2 [39]). Each baseline and its FSRFNet counterpart were trained using standard data enhancement strategies [58,59]. During the training process, the image is flipped horizontally, filled with four pixels on each side, and then randomly 32 32 × cropped. We report the performance of each baseline and its FSRFNet counterpart on CIFAR-10 and CIFAR-100 [57] in Table 5 and 6. From the results, we can find that FSRFNets outperform the baseline architectures in every comparison, suggesting that the benefits of FSRF blocks are not confined to the Tiny ImageNet [54] dataset. Remarkably, FSRFNet-29 outperforms ResNeXt-29, 16×64d by above absolute 0.21% accuracy, and almost halves the number of parameters, which is extremely efficient. For Lightweight models, we compare FSRF blocks with the ShuffleNetV2 [38] and MobileNetV2 [39] baseline models based on SE blocks and SK blocks. ShuffleNetV2 _ 0.5 + FSRF × and ShuffleNetV2 _ 1.0 + FSRF × achieve better performance than other models with corresponding scaling.

CIFAR Classification
To further evaluate the performance of FSRFNets, we conduct experiments on CIFAR-10 and CIFAR-100 [57]. The CIFAR-10 [57] dataset consists of 60,000 32 × 32 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR-100 [57] dataset resembles the CIFAR-10 [57], except that it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 [57] are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
We use the same approach as above to integrate FSRF blocks into several popular baseline frameworks (ResNeXt-29 [9], ShuffleNetV2 [38], and MobileNetV2 [39]). Each baseline and its FSRFNet counterpart were trained using standard data enhancement strategies [58,59]. During the training process, the image is flipped horizontally, filled with four pixels on each side, and then randomly 32 × 32 cropped. We report the performance of each baseline and its FSRFNet counterpart on CIFAR-10 and CIFAR-100 [57] in Tables 5 and 6. From the results, we can find that FSRFNets outperform the baseline architectures in every comparison, suggesting that the benefits of FSRF blocks are not confined to the Tiny ImageNet [54] dataset. Remarkably, FSRFNet-29 outperforms ResNeXt-29, 16×64d by above absolute 0.21% accuracy, and almost halves the number of parameters, which is extremely efficient. For Lightweight models, we compare FSRF blocks with the ShuffleNetV2 [38] and MobileNetV2 [39] baseline models based on SE blocks and SK blocks. ShuffleNetV2_0.5 × + FSRF and ShuffleNetV2_1.0 × + FSRF achieve better performance than other models with corresponding scaling.

Visualization with Grad-CAM
To intuitively understand the adaptive RF sizes of neurons through the additive effects of feature-selective and spatial attention of FSRFNet, we use the Grad-CAM method [60] to visualize the class activation mapping (CAM) of SKNet-50 and our proposed FSRFNet-50 backbone networks. In the visualization examples shown in Figure 8, the areas with light colors indicate that the current area has great influence on the classification result. SKNet achieves good results in multi-scale information selection. However, Since FSRFNet has stronger ability to adaptively select the appropriate convolution kernel size, the FSRFNet has activation maps that tend to cover the whole object. Finally, compared with SKNet, our FSRFNet has a better class activation maps.

Visualization with Grad-CAM
To intuitively understand the adaptive RF sizes of neurons through the additive effects of feature-selective and spatial attention of FSRFNet, we use the Grad-CAM method [60] to visualize the class activation mapping (CAM) of SKNet-50 and our proposed FSRFNet-50 backbone networks. In the visualization examples shown in Figure 8, the areas with light colors indicate that the current area has great influence on the classification result. SKNet achieves good results in multi-scale information selection. However, Since FSRFNet has stronger ability to adaptively select the appropriate convolution kernel size, the FSRFNet has activation maps that tend to cover the whole object. Finally, compared with SKNet, our FSRFNet has a better class activation maps.

Bird
Bus Airplane

Conclusions
In this paper, inspired by the additive effect of feature-selective and spatial attention on the receptive field sizes of the visual cortex neurons, we constructed the Feature-selective and Spatial Receptive (FSRF) block and inserted it into existing convolutional architecture to form the FSRFNet

Conclusions
In this paper, inspired by the additive effect of feature-selective and spatial attention on the receptive field sizes of the visual cortex neurons, we constructed the Feature-selective and Spatial Receptive (FSRF) block and inserted it into existing convolutional architecture to form the FSRFNet architecture. The FSRF block is implemented via three operations: Multi-branch Convolution, Fuse, and Interactions between Feature-selective and Spatial Attention. Fuse combines the results of multiple branches with different kernel sizes, and constructs attention building blocks (average and max channel; and average and max spatial), on which SoftMax and sigmoid operators are applied. Numerous experiments have demonstrated the effectiveness of FSRFNet from large models to small models and from large datasets to small datasets, including various benchmarks.
Author Contributions: X.M. contributed to the paper in conceptualization, methodology, formal analysis, software, visualization, data curation and review and editing. Z.Y. contributed to the paper in conceptualization, methodology, investigation and original draft preparation. Z.Y. contributed to the paper in conceptualization, methodology, investigation and original draft preparation.