A Lightweight Convolutional Neural Network Based on Hierarchical-Wise Convolution Fusion for Remote-Sensing Scene Image Classiﬁcation

: The large intra-class difference and inter-class similarity of scene images bring great challenges to the research of remote-sensing scene image classiﬁcation. In recent years, many remote-sensing scene classiﬁcation methods based on convolutional neural networks have been proposed. In order to improve the classiﬁcation performance, many studies increase the width and depth of convolutional neural network to extract richer features, which increases the complexity of the model and reduces the running speed of the model. In order to solve this problem, a lightweight convolutional neural network based on hierarchical-wise convolution fusion (LCNN-HWCF) is proposed for remote-sensing scene image classiﬁcation. Firstly, in the shallow layer of the neural network (groups 1–3), the proposed lightweight dimension-wise convolution (DWC) is utilized to extract the shallow features of remote-sensing images. Dimension-wise convolution is carried out in the three dimensions of width, depth and channel, and then, the convoluted features of the three dimensions are fused. Compared with traditional convolution, dimension-wise convolution has a lower number of parameters and computations. In the deep layer of the neural network (groups 4–7), the running speed of the network usually decreases due to the increase in the number of ﬁlters. Therefore, the hierarchical-wise convolution fusion module is designed to extract the deep features of remote-sensing images. Finally, the global average pooling layer, the fully connected layer and the Softmax function are used for classiﬁcation. Using global average pooling before the fully connected layer can better preserve the spatial information of features. The proposed method achieves good classiﬁcation results on UCM, RSSCN7, AID and NWPU datasets. The classiﬁcation accuracy of the proposed LCNN-HWCF on the AID dataset (training:test = 2:8) and the NWPU dataset (training:test = 1:9), with great classiﬁcation difﬁculty, reaches 95.76% and 94.53%, respectively. A series of experimental results show that compared with some state-of-the-art classiﬁcation methods, the proposed method not only greatly reduces the number of network parameters but also ensures the classiﬁcation accuracy and achieves a good trade-off between the model classiﬁcation accuracy and running speed.


Introduction
Remote sensing is a technique that uses electromagnetic radiation sensors on objects, perceiving some features of objects and analyzing them. In recent years, with the continuous development of remote-sensing technology, the resolution of remote-sensing images obtained from remote-sensing satellites has been continuously improving. Fine-scale ne-scale information can be obtained from high-resolution remote-sensing images, which makes remote-sensing images widely used in many fields [1][2][3][4][5][6].
As shown in Figure 1, compared with general images, remote-sensing scene images contain richer, more detailed and more complex ground objects. A remote-sensing image with a specific scene label usually contains multiple object scene labels. In Figure 1a, the scene label is 'River', and the object labels are 'Forest', 'Residential', etc. In Figure 1b, the scene label is 'Overpass', and the object labels are 'Parking', 'Rivers', 'Buildings', etc. Object labels can cause some confusion in the classification of scene labels, resulting in errors in the classification. In addition, as shown in Figure 2, the intra-class difference and inter-class similarity of remote-sensing images also bring great challenges to the correct classification of remote-sensing scene images. In Figure 2a, in the same 'airplane' scene, the size, shape and background pattern of the airplane are different. In Figure 2b, the different scenes of 'highway', 'railway', 'runway' have similar texture features. These characteristics of remote-sensing scene images bring difficulties to the classification.   ne-scale information can be obtained from high-resolution remote-sensing images, which makes remote-sensing images widely used in many fields [1][2][3][4][5][6]. As shown in Figure 1, compared with general images, remote-sensing scene images contain richer, more detailed and more complex ground objects. A remote-sensing image with a specific scene label usually contains multiple object scene labels. In Figure 1a, the scene label is 'River', and the object labels are 'Forest', 'Residential', etc. In Figure 1b, the scene label is 'Overpass', and the object labels are 'Parking', 'Rivers', 'Buildings', etc. Object labels can cause some confusion in the classification of scene labels, resulting in errors in the classification. In addition, as shown in Figure 2, the intra-class difference and inter-class similarity of remote-sensing images also bring great challenges to the correct classification of remote-sensing scene images. In Figure 2a, in the same 'airplane' scene, the size, shape and background pattern of the airplane are different. In Figure 2b, the different scenes of 'highway', 'railway', 'runway' have similar texture features. These characteristics of remote-sensing scene images bring difficulties to the classification.   As a powerful image analysis tool, convolutional neural network has achieved great success in the field of image classification. For example, MobileNet [7], VGGNet [8], ResNet [9] and DenseNet [10] models have achieved impressive results in different visual tasks, such as image classification and target detection. Then, a series of remotesensing scene classification methods based on convolutional neural network were proposed. Zeng et al. [11] proposed a new end-to-end convolutional neural network. It integrates global context features and local object features, which makes the proposed method more discriminative in scene classification. Wang et al. [12] proposed a multi-level feature fusion structure for remote-sensing scene classification based on global context information. In order to reduce the complexity of traditional convolutional neural networks, Shi et al. [13] proposed a lightweight convolutional neural network based on attention multi-branch feature fusion, which improves the classification performance of the network through the combination of attention mechanism and hybrid convolution. Liu et al. [14] proposed a two-stage deep feature fusion convolutional neural network, which can adaptively integrate the feature information of the middle layer and the fully connected layer in order to make full use of the abundant information in the shallow layer, which effectively improves the classification performance of the network.
In this paper, a lightweight convolutional neural network based on hierarchical-wise convolution fusion (LCNN-HWCF) is proposed. Firstly, a lightweight dimension-wise convolution is designed. Dimension-wise convolution is carried out in the three dimensions of width, length and channel, respectively, and then, the convoluted features of the three dimensions are fused. In the shallow layer of the network (groups 1-3), the combination of continuous dimension-wise convolution and max pooling is utilized to extract remotesensing image features. With the deepening of the network, the size of the feature map decreases, and the number of filters increases. The increase in the number of filters brings a large number of parameters and calculations, resulting in the decrease in the running speed of the whole network. In order to further reduce the complexity of the network and improve the running speed of the network, we propose a hierarchical-wise convolution fusion module for the deep layer of the network (groups 4-7). In the classifier (groups 8), the features of the convolution output of the last layer are successively passed through the global average pooling layer, the fully connected layer and the Softmax classifier to generate the probability of each scene category. Using global averaging pooling before the fully connected layer can preserve the spatial information of the feature effectively.
The main contributions of this paper are as follows: (1) A new lightweight dimension-wise convolution is proposed. Dimension-wise convolution is carried out along the three dimensions of width, length and channel, respectively, and then, the convoluted features of the three dimensions are fused. Compared with traditional convolution, dimension-wise convolution significantly reduces the number of parameters and computations, and has stronger feature extraction ability. (2) A hierarchical-wise convolution fusion module is designed. The hierarchical-wise convolution fusion module first groups the input along the channel dimension and selects the first group of features to map to the next layer directly. The second group of features first uses the dimension-wise convolution for feature extraction and then divides the output features into two parts; one is mapped to the next layer, and the other is concatenated with the next group of features. The concatenated features are operated by dimension-wise convolution. Repeat the above operation several times until all groups are processed. (3) In the classification phase, a combination of global average pooling, fully connected layer and Softmax is adopted to convert the input features into the probability of each category. Global average pooling is used before the fully connected layer can preserve the spatial information of features as much as possible. (4) A lightweight convolutional neural network is constructed by using dimensionwise convolution, hierarchical-wise convolution fusion module and classifier. The superiority of the proposed method is proven by a series of experiments. The rest of this paper is as follows. In Section 2, the related work of this paper is introduced. In Section 3, the dimension-wise convolution, hierarchical-wise convolution fusion module, classification module and LCNN-HWCF method are introduced in detail. In Section 4, the proposed LCNN-HWCF method is compared with some state-of-theart methods. In Section 5, the proposed dimension-wise convolution and the traditional convolution are discussed using some ablation experiments. Section 6 gives the conclusions.

Related Work
Recently, for lightweight convolutional neural networks, the traditional convolution has been replaced by various variants of convolution and achieved great success. In this paper, the traditional three-dimensional convolution is split, and a dimension-wise convolution is proposed. The number of parameters and computations of this convolution are much lower than those of traditional convolution. In addition, in order to design a lightweight network, a hierarchical-wise convolution fusion module is proposed to extract the deep and complex features of the scene image. Hierarchical-wise convolution fusion module is an improved form of group convolution. Before that, we first review the related work, including the convolution variant structure and group convolution.

Convolution Variant Structure
Singh et al. [15] designed a heterogeneous convolution from the perspective of optimizing the convolution structure. In a heterogeneous convolution, some channels of input features used k × k convolution kernel; the remaining channels used 1 × 1 convolution kernel. In addition, the super parameter p was designed to control the proportion of k × k convolution kernel. Chen et al. [16] proposed dynamic convolution without increasing the depth and width of the network. Dynamic convolution dynamically aggregated multiple parallel small-size convolution kernels according to attention. These parallel small-size convolution kernels aggregate in a nonlinear way through attention, which has stronger feature representation ability and higher computational efficiency. Different from the traditional convolution, using small convolution kernel to fuse spatial and channel information, Liu et al. [17] proposed self-calibrated revolution, which can adaptively establish long-distance spatial and channel dependencies around each spatial location through self-calibrated operation, so as to generate more discriminative features and extract richer context information. Chen et al. [18] proposed octave convolution from the perspective of frequency. Octave convolution divided the input features into high-frequency features and low-frequency features along the channel dimension, and the ratio of high frequency and low frequency was controlled by the super parameter α. In octave convolution, the spatial resolution of low-frequency features is reduced by half, which effectively reduces the number of parameters and calculations. Octave convolution effectively improves the representation ability of features and promotes the fusion of information through the interaction between high-frequency and low-frequency information. In the deep layer of the network, with the increase in the number of filters, it will not only bring a huge number of parameters and calculations but also produce a lot of redundant information. Han et al. [19] improved the redundant information generated by traditional convolution and proposed ghost convolution. Ghost convolution extracts rich feature information through traditional convolution operation and uses linear transformation to generate redundant information, which effectively reduces the computational complexity of the model. The convolution parameters of traditional convolution are shared by all samples. Once the convolution parameters are determined, no matter what samples are input for testing, the features are extracted using fixed convolution parameters. Yang et al. [20] proposed conditionally parameterized convolutions. Conditionally parameterized convolutions can obtain a customized convolution kernel for each input sample in each batch, which can improve the model capacity while maintaining efficient running speed. Cao et al. [21] combined depth-wise convolution with traditional convolution and proposed depth-wise over parameterized convolution. Depth-wise over parameterized convolution first uses depth-wise convolution for input features and finally uses traditional convolution for output intermediate results.

Group Convolution
AlexNet, proposed in 2012, adopts group convolution for the first time. Due to the limitation of hardware conditions at that time, Krizhevsky et al. [22] used multiple GPUs for training; each GPU completed part of the convolution and finally fused the convolution results of multiple GPUs. Xie et al. [23] improved ResNet by using the idea of group convolution and proposed ResNext. ShuffleNet, proposed by Zhang et al. [24], is another generalization of group convolution. ShuffleNet proposed the channel shuffling operation based on group convolution, which solves the problem of lack of information interaction between different groups caused by group convolution. Wu et al. [25] grouped the input along the channel dimension and then recalibrated each group of features using channel attention. The combination of channel attention and group convolution effectively improves the feature representation ability of the network. Liu [26] et al. proposed a lightweight hybrid group convolutional neural network. The hybrid group convolutional neural network adopted traditional convolution and dilated convolution in different groups. The information exchange of convolution-fused features was carried out through the channel shuffling operation to improve the performance of the network. Shen et al. [27] adopted group attention fusion strategy to improve network classification performance.

The Overall Structure of the Proposed LCNN-HWCF Method
The overall structure of the proposed LCNN-HWCF method is shown in Figure 3, which is composed of 8 parts (groups 1-8). Groups 1 to 3 are used to extract shallow features of remote-sensing images. These three groups are all composed of two continuous dimension-wise convolutions and max pooling. Dimension-wise convolution is designed to extract features from three dimensions: length, width and channel. Max pooling is utilized to downsample the convoluted features, reducing the number of parameters and computations, while preserving the main features and avoiding fitting. Groups 4 and 7 adopt four hierarchical-wise convolution fusion modules to extract the deep features of scene images, respectively. Groups 4 to 7 correspond to the hierarchical-wise convolution fusion module A to hierarchical-wise convolution fusion module D in Section 3.3, respectively. The structure of group 8 is shown in Figure 4.     Group 8 is composed of the global average pooling layer (GAP), the fully connected layer (FC) and the Softmax classifier, which is used to convert the feature information extracted by convolution into the probability of each scene. Because the features extracted by convolution contain spatial information, if these features are directly mapped into feature vectors through the fully connected layer, the spatial information of the features will be destroyed. Global average pooling will not destroy the spatial information of features. Therefore, global average pooling should be carried out first and then the fully connected layer. Suppose that the output of the last convolution layer is  Group 8 is composed of the global average pooling layer (GAP), the fully connected layer (FC) and the Softmax classifier, which is used to convert the feature information extracted by convolution into the probability of each scene. Because the features extracted by convolution contain spatial information, if these features are directly mapped into feature vectors through the fully connected layer, the spatial information of the features will be destroyed. Global average pooling will not destroy the spatial information of features. Therefore, global average pooling should be carried out first and then the fully connected layer. Suppose that the output of the last convolution layer is E = [e 1 , e 2 , · · ·, e n ] ∈ R H×W×n , R represents the real number set, and H, W and n represent the length, width and number of channels of the input data, respectively. If the output result of global average pooling is o = (o 1 , o 2 , · · ·, o n ) ∈ R 1×1×n , the output o i of global average pooling to ∀e i can be represented as As can be seen from Formula (1), global average pooling can map the features of the last layer convolution output to each category more intuitively. Then, the weight matrix w T i multiplies the global average pooling output vector o = (o 1 , o 2 , · · ·, o n ) to obtain the vector V = (υ 1 , υ 2 , · · ·, υ K ), which is called the fractional vector, where K is the number of categories. Finally, the non-normalized K-dimensional fractional vector V = (υ 1 , υ 2 , · · ·, υ K ) is mapped to the normalized K-dimensional probability vector P = (ρ 1 , ρ 2 , · · ·, ρ K ) using the Softmax function. The specific process is as follows: In this paper, the cross-entropy loss is adopted as the loss function l, which can be represented as

of 26
In Formula (3), ρ i represents the output result of the Softmax function, K is the number of categories, and Y = [y 1 , y 2 , · · ·, y K ] represents the coding result of the input sample label. y i is one hot vector. If the predicted category is the same as the category marked by the sample, it is 1; otherwise, it is 0. Therefore, the loss function can be further written as

Dimension-Wise Convolution
For convolutional neural networks, convolution is used to extract the features of the input image and generate some feature maps. With the deepening of the network, the size of the feature map gradually decreases, and the more representative features are extracted. It is necessary to increase the number of convolution kernels to extract more sufficient features from the previous layer. However, as the number of convolution kernels increases, the number of parameters and the amount of computation also gradually increase. In order to solve this problem, a dimension-wise convolution is proposed. The structure of the traditional convolution and the dimension-wise convolution is shown in Figure 5. As shown in Figure 5b, the dimension-wise convolution adopts three different scale convolutions with dimension to convolute along the length, width and channel direction, respectively, and then integrates these features of three-dimensional convolution. Compared with traditional convolution, the proposed dimension-wise convolution has obvious advantages in the number of parameters and the amount of computation. The comparison of the number of parameters and computation amount between the traditional convolution and the proposed dimension-wise convolution is analyzed as follows.
First, the number of parameters is analyzed. Suppose Y and in C represent the length, width and number of channels of the input features, respectively. As shown in Figure 5a, the convolution kernels of traditional convolution are Here, X and Y represent the length and width of the convolution As shown in Figure 5b, the dimension-wise convolution adopts three different scale convolutions with dimension W X ∈ R X×1×1 , dimension W Y ∈ R 1×Y×1 and dimension W C ∈ R 1×1×C out to convolute along the length, width and channel direction, respectively, and then integrates these features of three-dimensional convolution. Compared with traditional convolution, the proposed dimension-wise convolution has obvious advantages in the number of parameters and the amount of computation. The comparison of the number of parameters and computation amount between the traditional convolution and the proposed dimension-wise convolution is analyzed as follows. First, the number of parameters is analyzed. Suppose F ∈ R X 1 ×Y 1 ×C in , X 1 , Y 1 and C in represent the length, width and number of channels of the input features, respectively. As shown in Figure 5a, the convolution kernels of traditional convolution are W ∈ R C in ×X×Y×C out . Here, X and Y represent the length and width of the convolution kernel, respectively, C in represents the number of input channels of the convolution kernel, and C out represents the number of convolution kernels or the number of output channels. The output feature is S ∈ R X 1 ×Y 1 ×C out → W * F after convolution through the convolution kernel with step size 1. Then, the number of parameters of traditional convolution is The dimension-wise convolution convolutes along the length, width and channel direction, respectively, and then fuses the features along the three-dimensional convolution. The convolution kernel along the channel direction is W C ∈ R 1×1×C out , along the width direction is W Y ∈ R 1×Y×1 and along the length direction is W X ∈ R X×1×1 . The number of parameters of the dimension-wise convolution can be calculated as follows.
The feature after channel-wise convolution is S C ∈ R X 1 ×Y 1 ×C out → W C * F , and the number of parameters is C in · C out ; the feature after width-wise convolution is S Y ∈ R X 1 ×Y 1 ×1 → W Y * F , and the parameter quantity is C in · Y; the feature after lengthwise convolution is S X ∈ R X 1 ×Y 1 ×1 → W X * F , and the number of parameters is C in · X. The total number of parameters in the three dimensions is (X + Y + C out ) · C in . Because the number of channels of the convolution kernel is much larger than the width and height of the convolution kernel, i.e., C out W and C out H, the total number of parameters of the dimension-wise convolution is about equal to C in · C out , which is about 1/(XY) times that of the ordinary convolution.
Following that, the computational complexity is analyzed. The computational complexity of traditional convolution is The computational complexity of dimension-wise convolution is as follows.
The feature after channel-wise convolution is S C ∈ R X 1 ×Y 1 ×C out → W C * F , and the computational complexity is The feature after length-wise convolution is S X ∈ R X 1 ×Y 1 ×1 → W X * F , and the computational complexity is , which effectively reduces the computational complexity of convolution.

Hierarchical-Wise Convolution Fusion Module
With the deepening of the convolutional neural network, the number of filters increases gradually. Although the increase in filters helps extract more significant features, it also brings a lot of parameters and computation, resulting in the reduction in the running speed of the network. In order to solve this problem, a hierarchical-wise convolution fusion module is proposed to extract deeper features. Hierarchical-wise convolutions divide the input feature X ∈ R H×W×C → [X 1 , X 2 , X 3 · ··, X C ] along the channel dimension into 4 groups: , · · ·, X C ] . The number of channels in each group is C/4. Each time one of the groups is directly mapped to the next layer, the remaining group extracts features by dimension-wise convolution. The output features are divided into two branches; one is mapped to the next layer, and the other is connected with the input features of the next group. Channel concatenation is utilized to fuse different features and enhance information interaction among different groups. After concatenation, the features are extracted by dimension-wise convolution. This process is repeated several times until the remaining input features are processed. The four hierarchical-wise fusion modules are shown in Figures 6-9, respectively. The specific process is as follows.
In Formula (5),  represents convolution operation, f represents dimension-wise convolution, and  represents channel concatenation operation. In the hierarchical-wise convolution fusion module B, which is shown in Figure 7, only the second set of input features 2 x can be mapped directly to the next layer. The first group of input features 1 x first extracts features by dimension-wise convolution (DWC) to obtain 1 xf  and then divides the output features into two routes; one is mapped to the next layer, the other and the third group of input features 3 x to channel concatenation to obtain 3 1 (  are divided into two routes; one is mapped to the next layer, the other and the fourth group of features 4 x to channel concatenation to obtain After concatenation, the features are extracted by dimension-wise convolution. Finally, the features are mapped to the next layer. The whole process can be represented as In the hierarchical-wise convolution fusion module C, which is shown in Figure 8, only the third set of input features 3 x can be mapped directly to the next layer. The first group of input features 1 x first extracts features by dimension-wise convolution (DWC) to obtain 1 xf  and then divides the output features into two routes; one is mapped to the next layer, the other and the second group of input features 2 x to channel concatenation to obtain  are divided into two routes; one is mapped to the next layer, the other and the fourth group of features 4 x to channel concatenation to obtain are extracted by dimension-wise convolution. Finally, the features are mapped to the next layer. The whole process can be represented as In Formula (7),  represents convolution operation, f represents dimension-wise convolution, and  represents channel concatenation operation.  In the hierarchical-wise convolution fusion module D, which is shown in Figure 9, only the fourth set of input features 4 x can be mapped directly to the next layer. The first group of input features 1 x first extracts features by dimension-wise convolution (DWC) to obtain 1 xf  and then divides the output features into two routes; one is

Dataset Settings
The remote-sensing image dataset AID was published by Xia et al. [28] of Wuhan University and Huazhong University of Science and Technology in 2017. The AID dataset has 10,000 images and 30 different scene categories, including 'airport', 'bridge' and 'bareLand', etc. There are 220 to 420 remote-sensing images in each scene category, and the size of each remote-sensing image is approximately 600 × 600. The spatial resolution of the AID dataset is 0.5 m to 8 m. The RSSCN7 dataset was published in 2015 by Zou et al. [29] of Wuhan University. Scene images from different seasons and weather pose a major challenge to their classification. The RSSCN7 dataset has 2800 images and seven different scene categories, including 'grass', 'forest', 'field', 'parking', 'resident', 'industry' and 'riverlake'. Each scene category contains 400 scene images, with 400 × 400 pixels per scene image. The UCM dataset was published by Yang et al. [30] in 2010. The dataset has 2100 images and 21 different scene categories, including 'agricultural', 'airplane' and 'forest', etc. Each scene category contains 100 scene images, each with 256 × 256 pixels. The spatial resolution of the UCM dataset is 0.3 m. The NWPU45 dataset was published by Cheng et al. [31] of Northwest University of Technology in 2017. The NWPU45 dataset has 31,500 images and 45 different scene categories, including 'airplane', 'airport', 'baseball', etc. Each scene category contains 700 scene images, with 256 × 256 pixels per scene image. The spatial resolution of NWPU45 dataset is 0.2 m to 30 m.

Setting of the Experiments
Our experiments are based on the Keras framework and implemented on the NVIDIA GeForce RTX2060 GPU computer. To prevent memory overflow during training, the input image is clipped to 256 × 256 pixels during training. In the experiments, in order to make the network model converge more stably, the momentum optimizer is utilized for network training. The momentum factor is set to 0.9.
In the experiments, the size of the convolution kernel along the length ( X ) direction is 3 1 1 , and the size of the convolution kernel along the width ( Y ) direction is With the deepening of the network, the size of the convolution kernel along the channel ( C ) direction also gradually increases. From group 1 to group 7, the size is  In the hierarchical-wise convolution fusion module A, which is shown in Figure 6, only the first set of input features x 1 can be directly mapped to the next layer. The second group of input features x 2 first extracts the features by dimension-wise convolution (DWC) to obtain x 2 * f and then divides the output features x 2 * f into two routes; one is mapped to the next layer, the other and the third group of features x 3 to channel concatenation to obtain In Formula (5), * represents convolution operation, f represents dimension-wise convolution, and ⊕ represents channel concatenation operation. {[(·)]} means (·), [(·)] and {[(·)]} operations are performed successively.
In the hierarchical-wise convolution fusion module B, which is shown in Figure 7, only the second set of input features x 2 can be mapped directly to the next layer. The first group of input features x 1 first extracts features by dimension-wise convolution (DWC) to obtain x 1 * f and then divides the output features into two routes; one is mapped to the next layer, the other and the third group of input features x 3 to channel concatenation to obtain In Formula (6) In the hierarchical-wise convolution fusion module C, which is shown in Figure 8, only the third set of input features x 3 can be mapped directly to the next layer. The first group of input features x 1 first extracts features by dimension-wise convolution (DWC) to obtain x 1 * f and then divides the output features into two routes; one is mapped to the next layer, the other and the second group of input features x 2 to channel concatenation to obtain x 2 ⊕ (x 1 * f ). After concatenation, the features [x 2 ⊕ (x 1 * f )] * f are extracted by dimension-wise convolution. Then, the features [x 2 ⊕ (x 1 * f )] * f are divided into two routes; one is mapped to the next layer, the other and the fourth group of features x 4 to channel concatenation to obtain * f } * f are mapped to the next layer. The whole process can be represented as In Formula (7), * represents convolution operation, f represents dimension-wise convolution, and ⊕ represents channel concatenation operation. {[(·)]} means (·), [(·)] and {[(·)]} operations are performed successively.
In the hierarchical-wise convolution fusion module D, which is shown in Figure 9, only the fourth set of input features x 4 can be mapped directly to the next layer. The first group of input features x 1 first extracts features by dimension-wise convolution (DWC) to obtain x 1 * f and then divides the output features into two routes; one is mapped to the next layer, the other and the second group of input features x 2 to channel concatenation to obtain x 2 ⊕ (x 1 * f ). After concatenation, the features [x 2 ⊕ (x 1 * f )] * f are extracted by dimension-wise convolution. Then, the features [x 2 ⊕ (x 1 * f )] * f are divided into two routes; one is mapped to the next layer, the other and the fourth group of features x 3 to channel concatenation to obtain In Formula (8), * represents convolution operation, f represents dimension-wise convolution, and ⊕ represents channel concatenation operation. {[(·)]} means (·), [(·)] and {[(·)]} operations are performed successively.

Dataset Settings
The remote-sensing image dataset AID was published by Xia et al. [28] of Wuhan University and Huazhong University of Science and Technology in 2017. The AID dataset has 10,000 images and 30 different scene categories, including 'airport', 'bridge' and 'bareLand', etc. There are 220 to 420 remote-sensing images in each scene category, and the size of each remote-sensing image is approximately 600 × 600. The spatial resolution of the AID dataset is 0.5 m to 8 m. The RSSCN7 dataset was published in 2015 by Zou et al. [29] of Wuhan University. Scene images from different seasons and weather pose a major challenge to their classification. The RSSCN7 dataset has 2800 images and seven different scene categories, including 'grass', 'forest', 'field', 'parking', 'resident', 'industry' and 'riverlake'. Each scene category contains 400 scene images, with 400 × 400 pixels per scene image. The UCM dataset was published by Yang et al. [30] in 2010. The dataset has 2100 images and 21 different scene categories, including 'agricultural', 'airplane' and 'forest', etc. Each scene category contains 100 scene images, each with 256 × 256 pixels. The spatial resolution of the UCM dataset is 0.3 m. The NWPU45 dataset was published by Cheng et al. [31] of Northwest University of Technology in 2017. The NWPU45 dataset has 31,500 images and 45 different scene categories, including 'airplane', 'airport', 'baseball', etc. Each scene category contains 700 scene images, with 256 × 256 pixels per scene image. The spatial resolution of NWPU45 dataset is 0.2 m to 30 m.

Setting of the Experiments
Our experiments are based on the Keras framework and implemented on the NVIDIA GeForce RTX2060 GPU computer. To prevent memory overflow during training, the input image is clipped to 256 × 256 pixels during training. In the experiments, in order to make the network model converge more stably, the momentum optimizer is utilized for network training. The momentum factor is set to 0.9.

Performance of the Proposed LCNN-HWCF Method
The OA, AA, F1 and kappa coefficients of the proposed LCNN-HWCF method on four datasets under different training ratios are listed in Table 1. The overall accuracy (OA) is the ratio between the correct prediction and the overall quantity of all test sets, and the average accuracy (AA) is the ratio between the correct prediction of each type and the overall quantity of each type. F1 is the weighted average of accuracy and recall, which is used to measure the overall performance of the proposed method. Kappa coefficient is adopted to measure whether the predicted results are consistent with the real results. It can be seen from Table 1 that the proposed LCNN-HWCF method can achieve good performance on the four datasets with different training ratios.

Experimental Results on UCM Dataset
The comparison of experimental results between the proposed method and some stateof-the-art methods on the UCM dataset with a training ratio of 80% is shown in Table 2. When the training proportion is 80%, the overall accuracy of the lightweight method LCNN-BFF [32] is 99.29%, which is 0.24% higher than that of the Scale-Free CNN [33], Inceptionv3+CapsNet [34] and DDRL-AM [35] method but still 0.24% lower than that of our method, and the parameters of the proposed method are only 9.6% of the parameters of LCNN-BFF. This proves that the proposed method achieves a good trade-off between the accuracy of classification and the number of parameters. Table 2. Comparison of OA and parameters between the proposed method and some advanced methods on the UCM dataset with 80% training ratio.

Experimental Results on RSSCN7 Dataset
The comparison of experimental results between the proposed method and some advanced methods on the RSSCN dataset with a training ratio of 50% is listed in Table 3. Because the scene images in the RSSCN dataset come from different seasons and weather, the classification of this dataset is challenging. As shown in Table 3, when the train-

Experimental Results on RSSCN7 Dataset
The comparison of experimental results between the proposed method and some advanced methods on the RSSCN dataset with a training ratio of 50% is listed in Table 3. Because the scene images in the RSSCN dataset come from different seasons and weather, the classification of this dataset is challenging. As shown in Table 3, when the training proportion is 50%, the classification accuracy of the proposed method is 97.65%, 2.44% higher than that of ADFF [38], 2.11% higher than that of Coutourlet CNN [52] and 2.94% higher than that of SE-MDPMNet [53]. Compared with some lightweight methods, i.e., LCNN-BFF Method [32] and SE-MDPMNet [53], the parameters of the proposed method are only 9.6% and 11.6% of them.  [36] 89.1 32 M TSDFF Method [14] 92.37 ± 0.72 50 M ResNet+SPM-CRC Method [37] 93.86 23 M ResNet+WSPM-CRC Method [37] 93.9 23 M LCNN-BFF Method [32] 94.64 ± 0.21 6.2 M ADFF [38] 95.21 ± 0.50 23 M Coutourlet CNN [52] 95.54 ± 0.17 12.6 M SE-MDPMNet [53] 94.71 ± 0.15 5.17 M Proposed 97.65 ± 0.12 0.6 M The confusion matrix of the proposed method on the RSSCN dataset is shown in Figure 11. From Figure 11, it can be seen that although the proposed method does not fully recognize any scene image in this dataset, the accuracy of all scene images is over 97%, and the proposed method still achieves good results. ing proportion is 50%, the classification accuracy of the proposed method is 97.65%, 2.44% higher than that of ADFF [38], 2.11% higher than that of Coutourlet CNN [52] and 2.94% higher than that of SE-MDPMNet [53]. Compared with some lightweight methods, i.e., LCNN-BFF Method [32] and SE-MDPMNet [53], the parameters of the proposed method are only 9.6% and 11.6% of them.  [37] 93.86 23 M ResNet+WSPM-CRC Method [37] 93.9 23 M LCNN-BFF Method [32] 94.64 ± 0.21 6.2 M ADFF [38] 95.21 ± 0.50 23 M Coutourlet CNN [52] 95.54 ± 0.17 12.6 M SE-MDPMNet [53] 94.71 ± 0. The confusion matrix of the proposed method on the RSSCN dataset is shown in Figure 11. From Figure 11, it can be seen that although the proposed method does not fully recognize any scene image in this dataset, the accuracy of all scene images is over 97%, and the proposed method still achieves good results.

Experimental Results on NWPU Dataset
Compared with the UCM dataset, RSSCN dataset and AID dataset, the NWPU dataset has more remote-sensing scene images, which brings great challenges to classification in this dataset. The comparison of experimental results between the proposed method and some advanced methods on NWPU datasets with 10% and 20% training ratio is shown in Table 5. When the training proportion is 10%, the classification accuracy of the proposed method is 93.10%, which is 6.57%, 8.77%, 6.87% and 2.87% higher than that of LCNN-BFF [32], Skip-Connected CNN [42], ResNet50 [54] and LiG with RBF kernel [56], respectively. When the training proportion is 20%, the classification accuracy of the proposed method is 94.53%, which is 2.8%, 7.23% and 1.28% higher than that of the lightweight methods LCNN-BFF [32], Skip-Connected CNN [42], LiG with RBF kernel [56], respectively, and the number of parameters of the proposed method is only 9.6%, 10% and 28.9% of them. These further verify the validity of the proposed method.  Figure 12. Confusion matrix of the proposed method on the AID dataset with 50% training ratio.

Experimental Results on NWPU Dataset
Compared with the UCM dataset, RSSCN dataset and AID dataset, the NWPU dataset has more remote-sensing scene images, which brings great challenges to classification in this dataset. The comparison of experimental results between the proposed method and some advanced methods on NWPU datasets with 10% and 20% training ratio is shown in Table 5. When the training proportion is 10%, the classification accuracy of the proposed method is 93.10%, which is 6.57%, 8.77%, 6.87% and 2.87% higher than that of LCNN-BFF [32], Skip-Connected CNN [42], ResNet50 [54] and LiG with RBF kernel [56], respectively. When the training proportion is 20%, the classification accuracy of the proposed method is 94.53%, which is 2.8%, 7.23% and 1.28% higher than that of the lightweight methods LCNN-BFF [32], Skip-Connected CNN [42], LiG with RBF kernel [56], respectively, and the number of parameters of the proposed method is only 9.6%, 10% and 28.9% of them. These further verify the validity of the proposed method. Table 5. Comparison of OA and parameters between the proposed method and some advanced methods on the AID dataset with 10% and 20% training ratio.
InceptionV3 [54] 85.46 ± 0.33 87.75 ± 0.43 45.37 M Contourlet CNN [52] 85.93 ± 0.51 89.57 ± 0.45 12.6 M LiG with RBF kernel [56] 90. 23  The confusion matrix of the proposed method on the NWPU dataset with 20% training ratio is shown in Figure 13. As can be seen from Figure 13, the proposed method achieves a classification accuracy of more than 90% for all scenes in the NWPU dataset. For the most confusing scenarios, namely 'palace' and 'church' with similar building shapes, the classification accuracy of the proposed method is still 93% and 92%, respectively, which proves the robustness of the proposed method.

Model Complexity Analysis
To further verify the advantages of the proposed method, CaffeNet [51], VGG-VD-16 [51], GoogLeNet [51], MobileNetV2 [53], SE-MDPMNet [53], LCNN-BFF [32] and the proposed method are compared on the AID dataset with training:test = 5:5. FLOPs, OA and the number of parameters are adopted as the evaluation index, and the floating points of operations (FLOPs) are utilized to measure the complexity of the proposed method. The size of the FLOPs value is inversely proportional to the complexity of the model. The experimental results are shown in Table 6. It can be seen from Table 6 that on the AID dataset with training:test = 5:5, with the proposed dimension-wise convolution and hierarchical-wise convolution fusion module, the parameters and FLOPs values of the proposed method are only 0.6 M and 1.7 M, respectively, and the classification accuracy reaches 97.43%. Compared with the lightweight network MobileNetV2 [53] and SE-MDPMNet

Model Complexity Analysis
To further verify the advantages of the proposed method, CaffeNet [51], VGG-VD-16 [51], GoogLeNet [51], MobileNetV2 [53], SE-MDPMNet [53], LCNN-BFF [32] and the proposed method are compared on the AID dataset with training:test = 5:5. FLOPs, OA and the number of parameters are adopted as the evaluation index, and the floating points of operations (FLOPs) are utilized to measure the complexity of the proposed method. The size of the FLOPs value is inversely proportional to the complexity of the model. The experimental results are shown in Table 6. It can be seen from Table 6 that on the AID dataset with training:test = 5:5, with the proposed dimension-wise convolution and hierarchical-wise convolution fusion module, the parameters and FLOPs values of the proposed method are only 0.6 M and 1.7 M, respectively, and the classification accuracy reaches 97.43%. Compared with the lightweight network MobileNetV2 [53] and SE-MDPMNet [53], the proposed method optimizes the complexity of the model while guaranteeing high classification accuracy, further proving the effectiveness of the method.

Model Running Speed Comparison
In order to verify the advantages of the proposed lightweight dimension-wise convolution in terms of running speed, some experiments on running time are conducted on the UCM dataset. In this experiment, the average training time (ATT) is used as the evaluation index, and ATT represents the average time required for the model to process an image. Gated Bidirectiona+global feature [40], Gated Bidirectiona [40], Siamese ResNet50 [48], Siamese AlexNet [48], Siamese VGG-16 [48], LCNN-BFF [32] are selected for experimental comparison. In the experiment, all the experimental settings are the same. The experimental results are shown in Table 7. It can be seen from Table 7 that the ATT of the proposed method is the smallest, only 0.015 s, which is 0.014 s smaller than that of the LCNN-BFF [32] method and 0.013 s smaller than that of the Siamese AlexNet [48] method. The experimental results show that the proposed method has faster running speed. Table 7. Model running speed comparison on UCM dataset.

Visual Analysis
In order to more intuitively show the advantages of the proposed method, a variety of visualization methods are adopted to visually analyze the proposed method. Firstly, a gradient-weighted class activation map (Grad CAM) is utilized to visualize the proposed method. In order to prove that the proposed method can extract the significant features of remote-sensing images effectively, two methods, LCNN-BFF [32] and LCNN-GWHA [55], which have good classification performance, are chosen in the experiment for visual analysis on the UCM datasets. The visualization results are shown in Figure 14. The highlighted areas in the gradient-weighted class activation graph indicate how much attention is paid to the scene by the proposed method, while red highlighted areas indicate a higher degree of attention to the scene. From Figure 14, it can be seen that compared with LCNN-BFF [32] and LCNN-GWHA [55], the proposed method can extract scene features more effectively and have a better semantic coverage for the scene area. with LCNN-BFF [32] and LCNN-GWHA [55], the proposed method can extract scene features more effectively and have a better semantic coverage for the scene area. Then, the T-distributed random neighborhood embedding (T-SNE) visualization method is adopted to visually analyze the proposed method on the RSSCN dataset and UC dataset, respectively. The visualization results are shown in Figure 15. In the T-SNE visualization, data points of different colors represent different scene categories, and data points of the same color will gather together to form different semantic clusters. As can be seen from Figure 15, the proposed method increases the distance between different semantic clusters, effectively reduces the semantic confusion between similar scenes and improves the classification performance. Then, the T-distributed random neighborhood embedding (T-SNE) visualization method is adopted to visually analyze the proposed method on the RSSCN dataset and UC dataset, respectively. The visualization results are shown in Figure 15. In the T-SNE visualization, data points of different colors represent different scene categories, and data points of the same color will gather together to form different semantic clusters. As can be seen from Figure 15, the proposed method increases the distance between different semantic clusters, effectively reduces the semantic confusion between similar scenes and improves the classification performance.  Finally, some random image prediction experiments of the proposed method are carried out on the UC dataset. The experimental results are shown in Figure 16. As can be seen from Figure 16, the confidence of the proposed method for the predicted values and real label values of different scenes is more than 99%, which proves that the proposed method can effectively extract the features of remote-sensing scene images. Finally, some random image prediction experiments of the proposed method are carried out on the UC dataset. The experimental results are shown in Figure 16. As can be seen from Figure 16, the confidence of the proposed method for the predicted values and real label values of different scenes is more than 99%, which proves that the proposed method can effectively extract the features of remote-sensing scene images. Remote Sens. 2022, 13, x FOR PEER REVIEW 23 of 29 Figure 16. Random image test results.

Discussion
In this section, the advantages of dimension-wise convolution are discussed through four ablation experiments. In ablation experiment 1, the dimension-wise convolution in the shallow layer of the network (Group 1-3) was replaced with the traditional convolution, and the other groups (Group 4-8) remained unchanged. In ablation experiment 2, the dimension-wise convolution in the hierarchical-wise convolution fusion module (Group 4-8) was replaced with the traditional convolution, and the other groups (Group 1-3) remained unchanged. In ablation experiment 3, the dimension-wise convolution in the whole network (Group 1-8) was replaced by traditional convolution. In ablation experiment 4, the entire network was maintained. In the four ablation experiments, each ablation experiment used the same experimental equipment and super parameter settings. The experimental results on the UC dataset with training:test = 8:2 are shown in Figures 17 and 18

Discussion
In this section, the advantages of dimension-wise convolution are discussed through four ablation experiments. In ablation experiment 1, the dimension-wise convolution in the shallow layer of the network (Group 1-3) was replaced with the traditional convolution, and the other groups (Group 4-8) remained unchanged. In ablation experiment 2, the dimension-wise convolution in the hierarchical-wise convolution fusion module (Group 4-8) was replaced with the traditional convolution, and the other groups (Group 1-3) remained unchanged. In ablation experiment 3, the dimension-wise convolution in the whole network (Group 1-8) was replaced by traditional convolution. In ablation experiment 4, the entire network was maintained. In the four ablation experiments, each ablation experiment used the same experimental equipment and super parameter settings. The experimental results on the UC dataset with training:test = 8:2 are shown in Figures 17 and 18          As can be seen from Figures 17-20, on the two datasets, after using the traditional convolution to replace the dimension-wise convolution, the classification performance of Figure 19. Comparison of OA and kappa values between dimension-wise convolution and traditional convolution on RSSCN dataset with training ratio of 50%.    As can be seen from Figures 17-20, on the two datasets, after using the traditional convolution to replace the dimension-wise convolution, the classification performance of As can be seen from Figures 17-20, on the two datasets, after using the traditional convolution to replace the dimension-wise convolution, the classification performance of the network can be reduced. In particular, after replacing the dimension-wise convolution in the whole network with traditional convolution, the classification performance decreases most dramatically. As shown in Figure 17, compared with experiment 4, in experiment 3, the classification accuracy decreased by 1.28%, and the kappa value decreased by 1.40% on the UC dataset with a training ratio of 80%. As shown in Figure 19, compared with experiment 4, in experiment 3, the classification accuracy decreased by 1.18%, and the kappa value decreased by 1.24% on the RSSCN dataset with a training ratio of 50%. In addition, after using the traditional convolution, the number of parameters and FLOPs of the network increased to a certain extent. In particular, after replacing the dimension-wise convolution in the proposed network with traditional convolution, the parameter quantity and FLOPs value increased a lot. As shown in Figures 18 and 20, on the two datasets, compared with experiment 4, the parameter amount in experiment 3 increased by about 0.86 M, and the FLOPs value increased by about 3.67 M. A series of experiments proved the superiority of the proposed dimension-wise convolution.

Conclusions
In this paper, a lightweight convolutional neural network based on hierarchicalwise convolution fusion (LCNN-HWCF) is proposed for remote-sensing scene image classification. In the shallow layer of the proposed network, the features are extracted in the length, width and channel directions, respectively, by using dimension-wise convolution. In the deep layer of the network, the hierarchical-wise convolution fusion module is designed to solve the problem of the network parameters becoming larger due to the deepening of the network. Finally, under the conditions of multiple training ratios of UCM21, RSSCN7, AID and NWPU, the proposed method and some advanced methods are compared by a variety of experiments, and the experimental results prove the superiority of LCNN-HWCF. Among them, the classification accuracy of the proposed LCNN-HWCF method on the UCM dataset (training:test = 8:2) reaches 99.53%, almost completely realizing the correct recognition of all scene images.