A Lightweight Convolutional Neural Network Based on Channel Multi-Group Fusion for Remote Sensing Scene Classification

With the development of remote sensing scene image classification, convolutional neural networks have become the most commonly used method in this field with their powerful feature extraction ability. In order to improve the classification performance of convolutional neural networks, many studies extract deeper features by increasing the depth and width of convolutional neural networks, which improves classification performance but also increases the complexity of the model. To solve this problem, a lightweight convolutional neural network based on channel multi-group fusion (LCNN-CMGF) is presented. For the proposed LCNN-CMGF method, a three-branch downsampling structure was designed to extract shallow features from remote sensing images. In the deep layer of the network, the channel multi-group fusion structure is used to extract the abstract semantic features of remote sensing scene images. The structure solves the problem of lack of information exchange between groups caused by group convolution through channel fusion of adjacent features. The four most commonly used remote sensing scene datasets, UCM21, RSSCN7, AID and NWPU45, were used to carry out a variety of experiments in this paper. The experimental results under the conditions of four datasets and multiple training ratios show that the proposed LCNN-CMGF method has more significant performance advantages than the compared advanced method.


Introduction
The goal of remote sensing scene image classification is to correctly classify the input remote sensing images. Due to the wide application of remote sensing image classification in natural disaster detection, land cover analysis, urban planning and national defense security [1][2][3][4], the classification of remote sensing scene images has attracted extensive attention. In order to improve the performance of remote sensing scene classification, many methods have been proposed. Among them, convolutional neural networks have become one of the most successful deep learning methods with their strong feature extraction ability. Convolutional neural networks are widely used in image classification [5] and target detection [6]. Many excellent neural networks have been designed for image classification. For example, Li et al. [7] proposed a deep feature fusion network for remote sensing scene classification. Zhao et al. [8] proposed a multi-topic framework combining local spectral features, global texture features and local structure features to fuse features. Wang et al. [9] used an attention mechanism to adaptively select the key parts of each image, and then fused the features to generate more representative features.
In recent years, designing a convolutional neural network to achieve the optimal trade-off between classification accuracy and running speed has become a research hotspot. SqueezeNet [10] designed a lightweight network by squeezing and extending modules to The module can fully extract the shallow feature information, so as to accurately distinguish the target scene. (2) In the deep layer of the network, a channel multi-group fusion module is constructed for the extraction of deep features, which divided the input features into features with channel number of C/g and channel number of 2C/g, increasing the diversity of features. (3) To solve the problem of lack of information interaction for features between groups due to group convolution, in the channel multi-group module, the channel fusion of adjacent features is utilized to increase the information exchange, which significantly improves the performance of the network. The rest of this paper is as follows. In Section 2, the overall structure, shallow feature extraction module and channel multi-group module of the proposed LCNN-CMGF method are described in detail. Section 3 provides the experimental results and analysis. In Section 4, several visualization methods are adopted to evaluate the proposed LCNN-GMGF method. The conclusion of this paper is given in Section 5.

The Overall Structure of Proposed LCNN-CMGF Methods
As shown in Figure 1, the proposed network structure is divided into eight groups, the first three being used to extract shallow information from remote sensing images. Groups 1 and 2 adopt a proposed shallow downsampling structure, which is introduced in Section 2.2 in detail. Group 3 uses a hybrid convolution method combining standard convolution and depthwise separable convolution for feature extraction. Depthwise separable convolution has a significant reduction in the number of parameters compared with standard convolution. Assuming that the input feature size is H × W × C 1 , the convolution kernel size is H 1 × W 1 × C 1 and the output feature size is H × W × C 2 , the parameter quantity of standard convolution is: The parameter quantities of depthwise separable convolution is: The ratio params dsc /params conv of depthwise separable convolution to standard convolution is: According to Equation (3), when the convolution kernel size H 1 × W 2 is equal to 3 × 3, due to C 2 H 1 × H 2 , the parameter quantity of standard convolution is approximately 9 times that of depthwise separable convolution, and when the convolution kernel size H 1 × W 2 is equal to 5 × 5, the parameter of standard convolution is approximately 25 times that of depthwise separable convolution. With the increase in convolution kernel size, the parameter will be further reduced. However, depthwise separable convolution can inevitably lead to the loss of some feature information while significantly reducing the amount of parameters, and then make the learning ability of the network decline. Therefore, we propose to use the hybrid convolution of standard convolution and depthwise separable convolution for feature extraction, which not only reduces the weight parameters, but also improves the learning ability of the network. From group 4 to group 7, channel multi-group fusion structure is used to further extract deep feature information. Channel multi-group fusion structure can generate a large number of features with a few parameters to increase the feature diversity. Assuming that the input feature size is H × W × C 1 , the convolution kernel size is H 1 × W 1 × C 1 and the output feature size is H × W × C 2 , the parameter quantity of standard convolution is Divide the input features into t groups along the channel dimension, then the feature size of each group is H × W × C 1 t , the corresponding convolution kernel size is H 1 × W 1 × C 1 t , and the output feature size of each group is H × W × C 2 t . Connect the obtained t group features along the channel dimension to obtain that the final output feature size is H × W × C 2 . The parameter quantity of the whole process is As shown in Equations (4) and (5), the parameter quantity of group convolution is 1/t of the standard convolution parameter quantity, that is, under the condition of the same parameter quantity, the number of features obtained by group convolution is t times of the standard convolution, which increases the feature diversity and effectively improves the classification accuracy. The details are described in Section 2.3. Group 8 consists of a global average pooling layer (GAP), a fully connected layer (FC), and a softmax classifier to convert the convolutionally extracted feature information into probabilities for each scenario. Because features extracted by convolution contain spatial information, which is destroyed if features derived by convolution are directly mapped to the feature vector through a fully connected layer, and global average pooling does not. Assuming that the output of the last convolution layer is O = [o 1 ; o 2 ; . . . ; o i ; . . . o N ] ∈ R N×H×W×C , [; ; ; ] represents cascading operations along the batch dimension, and R represents the set of real numbers. In addition, N, H, W, C represent the number of samples per training, the height of the feature, the width of the feature, and the number of channels, respectively. Suppose the result of global average pooling is P = [p 1 ; p 2 ; . . . ; p i . . . ; p N ] ∈ R N×1×1×C , then the processing process of any P = [p 1 ; p 2 ; . . . ; p i . . . ; p N ] ∈ R N×1×1×C with the global average pooling layer can be represented as As shown in Equation (6), global average pooling more intuitively maps the features of the last layer convolution output to each class. Additionally, the global average pooling layer does not require weight parameters, which can avoid overfitting phenomena during training the model. Finally, a softmax classifier is used to output probability values for each scenario.

The Three-Branch Shallow Downsampling Structure
Max-pooling downsampling is a nonlinear downsampling method. For small convolutional neural networks, better nonlinearity can be obtained by using maximum pool downsampling. On the contrary, for deep neural networks, multi-layer superimposed convolutional downsampling can learn better nonlinearity than max-pooling according to the training set, as shown in Figure 2. Figure 2a,b represents convolution downsampling and max-pooling downsampling, respectively. The convolution downsampling in Figure 2a first uses the 3 × 3 convolution with step size of 1 for feature extraction of the input data, and then uses the 3 × 3 convolution with step size of 2 for downsampling. In the max-pooling downsampling in Figure 2b, the input feature is extracted by 3 × 3 convolution with step size of 1, and then the max-pooling downsampling with step size of 2 is adopted. Combining max-pooling downsampling and convolutional downsampling, we propose a three-branch downsampling structure as shown in Figure 3 for feature extraction, and use the input features to compensate the downsampling features, which can not only extract strong semantic features, but also retain shallow information. In groups 1 and 2 of the network, we use the structure shown in Figure 3 to extract shallow features. The structure is divided into three branches. The first branch uses 3 × 3 convolution with step size of 2 to obtain f down , and then uses 3 × 3 convolution with step size of 1 to extract the shallow features of the image to obtain f 1 (x). That is In Equations (7) and (8), δ represents the activation function Rule, BN represents batch standardization, F represents the input characteristics, K s represents the 3 × 3 convolution kernel with step size s, and * represents the convolution operation. The second branch uses the max-pooling with step size of 2 to downsample the input features to obtain f mij . The most responsive part of the max-pooling selection features enters the next layer, which reduces the redundant information in the network and makes the network easier to be optimized. The max-pooling downsampling can also reduce the estimated mean shift caused by the parameter error of the convolution layer, keep more texture information. Then, the shallow features f 2 (x) are extracted by 3 × 3 convolution with step size of 1. That is In Equation (9), f mij represents the max-pooling output value in rectangular area R ij related to the m-th feature, and x mst represents the element at the (s, t) position in rectangular area R ij .
The fused feature f (x) is obtained by fusing the features from Branch 1 and Branch 2. To reduce the loss of feature information caused by the first two branches, a residual branch is constructed to compensate for the loss of information. The fused feature f (x) and the third branch are fused to generate the final output feature y(x). That is The g(x) in Equation (11) is a residual connection implemented by 1 × 1 convolution.

Channel Multi-Group Fusion Structure
The proposed channel multi-group fusion structure is shown in Figure 4. It divides the input features with the number of channels C into two parts, one part is composed of 4 features with the number of channels C 4 , and the other part is composed of 2 features with the number of channels C 2 . First, the convolution operations are performed for features with the number of channels C 4 , the adjacent two convolution results are channel concatenated, the number of feature channels after concatenate is C 2 . Then, the convolution operations are performed on features with the number of channels C 2 , the adjacent two features convolution results are channel concatenated, the number of channels of each feature after fusion was C. The convolution operations are performed on features with the number of channels C, and the convolution results are fused to obtain the output features. This process can be described as follows.

Suppose that the input feature is
represents the i-th feature with the number of channels C 4 , and x i C 2 ∈ R W×H× C 2 represents the i-th feature with the number of channels C 2 . After channel grouping, the input features can be represented as , and x 4 C 4 , where the number of channels is C 4 , and after convolution the results are y 1 ∈ R W×H× C 4 , and can be represented as represents the convolution result of the feature x i C 4 , and y i (2), . . . , (m) represents the m-th channel of the i-th feature with the number of channels C 4 , f conv (·) represents the convolution operation, W represents the convolution weight, ReLU represents the activation function, and BN represents batch normalization.
The use of grouping convolution can reduce the requirement of computing power, but it will also lead to the lack of information interaction between group features, which makes the extracted features incomplete. The information interaction is enhanced through channel concatenate of two adjacent features (y 1 ). The number of feature channels after channel concatenate is C 2 , and t i C 2 ∈ R W×H× C 2 represents the i-th feature after channel concatenate, the channel concatenate operation of feature ς and feature is represented by is calculated as The features with the number of channel C 2 are processed by depthwise separable convolution and the results after convolution are y 1 where y i C 2 represents the convolution result of the features x i C 2 and t i C 2 , and y i (m) represents the m-th channel of the i-th feature with the number of channels C 2 , and f dsc (·) represents the depthwise separable convolution operation. Then, the adjacent features (y 1 ) are concatenated in the channel dimension. The number of feature channels after concatenate is C, and s i C ∈ R W×H×C represents the i-th feature with the number of channels after concatenate C. The calculation process of s i C is The features s 1 C , s 2 C , s 3 C and s 4 C with the number of channels C are processed by depthwise separable convolution, respectively. The convolution results are y 1 C , y 2 C , y 3 C and y 4 C . The calculation process of y i C is Next, the features y 1 C , y 2 C , y 3 C , y 4 C are fused and the fusion results and input feature X are shortcut to obtain the final output result Y ∈ R W×H×C , where denotes feature fusion.

Experiment
In this section, the proposed LCNN-CMGF method is evaluated from multiple perspectives using different indicators. The four most commonly used remote sensing scene datasets, UCM21, RSSCN7, AID and NWPU45, are used to carry out a variety of experi-ments in this paper. The experimental results under the conditions of four datasets and multiple training ratios show that the proposed LCNN-CMGF method has more significant performance advantages than the compared advanced method.

Dataset Settings
To verify the performance of the proposed LCNN-CMGF method, a series of experiments were performed on four datasets, i.e., UCM21 [21], RSSCN7 [22], AID [23], NWPU45 [24]. In addition to the complex spatial structure, remote sensing scene images also have high intra-class differences and similarities between classes, which make these four datasets very challenging. Details of the four datasets are shown in Table 1, including the number of images per class, the number of scene categories, the total number of images, the spatial resolution of images, and the image size. In addition, we select a scene image from each scene category of the four datasets for display, as shown in Figure 5. Due to the inconsistent size of the image, in order to avoid memory overflow in the training process, the bilinear interpolation method is used to adjust the size of the training image to 256 × 256.

Setting of the Experiments
When dividing the dataset, a stratified sampling method is adopted. The stratified sampling can effectively avoid the risk of sampling deviation. In stratified sampling, a random seed is set to ensure that the same images are chosen in each experiment. In addition, in order to improve the reliability of the experimental results, the average value of 10 experimental results is taken as the final result. According to previous work on remote sensing scene image classification, the datasets are divided as follows: the UCM21 [21] dataset is divided into training:test = 8:2, that is, 1680 scene images are used for training, and the remaining 420 scene images are used for testing; the RSSCN7 [22] dataset is divided into training:test = 5:5, that is, 1400 scene images are used for training, and the remaining 1400 scene images are used for testing; the AID30 [23] dataset is divided into training: test = 2:8 and training:test = 5:5, respectively. When training:test = 2:8, 2000 scene images for training and 8000 scene images for testing; When training:test = 5:5, 5000 scene images for training and 5000 scene images for testing; and the NWPU45 [24] dataset is divided into training:test = 1:9 and training:test = 2:8, respectively. When training:test = 1:9, 3150 scene images for training and 28,350 scene images for testing; When training:test = 2:8, 6300 scene images for training and 25,200 scene images for testing. As shown in Table 2, the input and output sizes of each group of features from group 1 to group 8 in the LCNN-CMGF method are listed. Table 3 shows the experimental environment and parameter setting.

Experimental Result
To verify the performance of the proposed method, evaluation indexes such as overall accuracy (OA), kappa coefficient (kappa), confusion matrix, and weighting parameters were used for experimental comparison.   Table 4. Under the condition that the training proportion of the UCM21 dataset was 80%, the classification accuracy of the proposed LCNN-CMGF method reaches 99.52%, which exceeds all the comparison methods. The proposed LCNN-CMGF method is 0.6% higher than the Lie group (LiG) with sigmoid kernel [25], 0.55% higher than the Contourlet CNN method [26], and 3.19% higher than the MobileNet method [27]. Table 5 lists the kappa coefficient of the proposed LCNN-CMGF method and the comparison methods. The kappa coefficient of the proposed method is 99.50%, 6.13% higher than EfficientNet [28], and 2.58% higher than Fine tune Mobilenet V2 [29], which proves the effectiveness of our method.   Table 4. OA(%) of eighteen methods and the LCNN-CMGF method at the training ratio of 80% in the UCM21 dataset.

Experimental Results on the RSSCN7 Dataset
The comparison of experimental results of the proposed methods and some stateof-the-art methods proposed in the last two years on RSSCN7 datasets are shown in Table 6. The OA of our proposed method is 97.50%, which is 0.85%, 3.9% and 1.52% higher than that of VGG-16-CapsNet [30], WSPM-CRC [31] and the Positional Context Aggregation method [32], respectively. It is proved that our method has better feature representation ability. Table 6. OA (%) of seven kinds of methods and the LCNN-CMGF method under the training ratios of 50% in the RSSCN7 dataset.

Experimental Results on the AID Dataset
Some excellent CNN-based methods on the AID dataset from 2018 to 2020 are selected for comparison with the proposed method. The experimental results are shown in Table 7. Under the condition that the training proportion of the AID dataset was 20%, the classification accuracy of the proposed LCNN-CMGF method is 93.63%, which is 1.57% higher than that of LCNN-BFF [33], 2.07% higher than that of DDRL-AM Method [34], 2.53% higher than that of Skip-Connected CNN [35], and 1.43% higher than that of GB-Net + Global Feature [36]. Under the condition that the training proportion of the AID dataset was 50%, the OA of the proposed method is reaching 97.54%, which is 2.09% higher than that of Feature Aggregation CNN [37], 2.28% higher than that of Aggregated Deep Fisher Feature [38], 0.79% higher than that of HABFNet [39], 2.15% higher than that of EfficientNetB3-Attn-2 [40], and 1.56% higher than that of VGG_VD16 with SAFF Method [41]. The experimental results show that the proposed method is very effective. For remote sensing scene images with rich image variation, high similarity between classes and strong intra-class differences, the proposed method can capture more representative features. As shown in Table 8, the Kappa coefficient of this method is 97.45% when the training proportion is 50%, which is 1.95% higher than that of Semi-Supervised Representation Learning [42], 1.33% higher than that of Variable-Weighted Multi-Fusion [43], 3.49% higher than that of TSDFF [44], and 3.19% higher than that of Discriminative+AlexNet [45]. The Kappa coefficient results demonstrate that the predicted and actual results of the proposed method are more consistent. Table 7. Kappa (%) of fourteen methods and the LCNN-CMGF method at the training ratio of 50% in the AID dataset.

Experimental Results on the NWPU45 Dataset
Similar to the AID dataset, some excellent neural networks on the NWPU45 dataset from 2018 to 2020 are selected for experimental comparison. The experimental results are shown in Table 9. When training:test = 1:9, the OA of the proposed method reached 92.53%, which is 6% higher than that of the LCNN-BFF method [33], 8.15% higher than that of VGG_VD16 with the SAFF method [41], 3.31% higher than that of Discriminative + VGG16 [45], 11.19% higher than that of VGG19 [46] and 0.97% higher than that of MSDFF [50], respectively. When training:test = 2:8, the OA of the proposed method is 6.6% and 2.45% higher than that of Contourlet CNN [26] and the LCNN-BFF method [33], respectively. Meanwhile, the OA of the proposed method is 8.2% higher than that of Skip-Connected CNN [35] and 3.31% higher than that of Discriminative + VGG16 [45]. This indicates that the proposed method performs better on both training ratios on NWPU45 datasets. Under the condition that the training proportion is 20% in the NWPU45 dataset, the kappa coefficient contrast results for the proposed LCNN-CMGF method and the contrast method are shown in Table 10. The Kappa value of this method is the highest among all comparison methods, reaching 94.04%. It is 1.12%, 5.69% and 10.72% higher than that of LiG with sigmoid kernel [25], Contourlet CNN [26] and MobileNet [27], respectively.
On the NWPU45 dataset, when training:test = 2:8, the confusion matrix obtained by the proposed LCNN-GWHA method is shown in Figure 9. Because the NWPU45 dataset contains rich content variations and a complex spatial structure, there are no fully recognized scenes when classifying the dataset. However, the classification accuracy of the proposed method for 43 scenes reached more than 90%. The lowest classification accuracy was for 'palace' and 'church', which were 87% and 88%, respectively. The main reason is that the two scenarios, 'palace' and 'church', have similar architectural styles and are easy to confuse when extracting features, resulting in classification errors. Table 9. OA (%) of seventeen methods and the LCNN-CMGF method at the training ratios of 20% and 10% in the NWPU45 dataset.

Comparison of the Computational Complexity of Models
In addition, to further demonstrate the advantages of the proposed methods in terms of speed, MobileNetV2 [12], CaffeNet [23], VGG-VD-16 [23], GoogleNet [23], Contourlet CNN [26], SE-MDPMNet [29], Inception V3 [46], ResNet50 [46], LiG with RBF kernel [51], and LGRIN [53] were used for comparison with the proposed LCNN-CMGF method. Some experiments were carried out on the AID dataset. The size of Giga Multiply-Accumulation operations per second (GMACs) was used as the evaluation index in the experiments. The GMACs measures the computational complexity of a model. The comparison of experimental results of these methods on the AID dataset with training:test = 5:5 is shown in Table 11. As shown in Table 11, the OA of the proposed LCNN-CMGF method is 97.54%, the parameter quantity is 0.8 M, and the GMACs value is 0.0160 G. Compared with other lightweight models LiG with RBF kernel [51] and MobleNetV2 [12], the proposed method achieves higher classification accuracy with less than half of the parameters of the two methods. Although the accuracy is slightly lower than that of LGRIN [53], the number of parameters is 3.83 M less than that of LGRIN [53], and the GMACs value is 0.4773 G less than that of LGRIN [53]. The proposed LCNN-CMGF method achieves a good trade-off between model complexity and classification accuracy.  Table 11. Evaluation values of ten methods and the LCNN-CMGF method at the training ratio of 50% in the AID dataset.

Comparison Results of Shallow Feature Extraction Modules
Convolutional neural network first extracts the shallow feature of images. With the deepening of the network, the extracted features are more abstract and contain more semantic content, as shown in Figure 10. Figure 10a is the original remote sensing scene image. After the feature extraction of convolutional neural network, the shallow feature map is shown in Figure 10b. With the deepening of the network, more complex features are extracted, as shown in Figure 10c. Compared with the features in Figure 10b, the features in Figure 10c are more complex and have more semantic information. For the classification of remote sensing images, both shallow and deep features are very useful. The traditional methods for extracting shallow features of images are shown in Figure 2a,b. The two methods are not sufficient to extract the shallow features of the image, and some information will be lost during the feature extraction process. Therefore, we propose a threebranch downsampling structure to extract the shallow features of images. The three-branch downsampling structure has great performance advantages compared with the traditional convolution method. In order to prove the effectiveness of the proposed three-branch downsampling structure, the feature extraction capabilities of the two traditional downsampling structures (as shown in Figure 2a,b) and the proposed three-branch downsampling structure (as shown in Figure 3) are experimentally compared. The comparison process is as follows. In the first experiment, the downsampling structure in Figure 2a is used to replace the three-branch downsampling structure in group 1 and group 2 ofthe proposed method, represented by method 1. In the second experiment, the downsampling structure in Figure 2b is used to replace the three-branch downsampling structure in group 1 and group 2 of the proposed method, represented by method 2. In the third experiment, the three-branch downsampling structure is preserved, which was represented by method 3. The specific experimental parameter settings are shown in Section 2.2. Some comparative experiments were conducted on the AID dataset with training:test = 5:5. For a fair comparison, the three experiments were carried out under the same experimental conditions. The experimental results are listed in Table 12. As shown in Table 12, the three methods have no obvious difference in parameter quantity and model complexity. However, compared with method 3, the other two methods have lower classification accuracy. Specifically, the classification accuracy of method 3 is 1.08% and 1.57% higher than that of method 1 and method 2, respectively.

Ablation Experiment
In this section, the effectiveness of the number of channels of each group in a channel multi-group fusion structure on network performance is analyzed. Firstly, the group with 2/C channels of the designed multi-grouping fusion structure is removed, and then the structure diagram with channel number C/4 is obtained, which is shown in Figure 11a. Secondly, the group with C/4 channels of the designed multi-grouping fusion structure is removed, and then the structure diagram with channel number C/2 is obtained, which is shown in Figure 11b. Finally, the complete multi-group fusion structure is used for comparison. Some comparative experiments were conducted on the AID dataset with a training proportion of 50%. OA, parameters and GMACs were adopted as evaluation indexes in the experiment. In the three experiments, to make a fair comparison, the experimental equipment and experimental parameter settings are all the same. The experimental results are listed in Table 13. As shown in Table 13, when the grouping structure with the number of channels C/4 is adopted, as shown in Figure 11a, the OA is 96.09%, the parameter quantity is 0.57 M, and the GMACs value is 0.0153 G. When the grouping structure with the number of channels C/2 is adopted, as shown in Figure 11b, the OA is 95.20%, the parameter quantity is 0.53 M, and the GMACs value is 0.0149 G. The two structures are similar in parameter quantity and model complexity, but the OA of the grouping structure with channel number C/4 is higher than that of the grouping structure with channel number C/2 because it increases the diversity of features. However, there is still a large gap in classification performance between the two methods and the proposed multi-group fusion structure, which further proves the effectiveness of the proposed method.

Discussions
In order to display the feature extraction ability of the proposed method more intuitively, a series of visualization methods are adopted to evaluate the proposed method. Firstly, the feature extraction ability of the proposed method is presented by using the visualization method of Class Activation Map (CAM). The CAM method displays important areas of the image predicted by the model by generating a rough attention map from the last layer of the convolutional neural network. Some images in the UCM21 dataset are chosen for visualization experiments, and the visualization results are shown in Figure 12. As shown in Figure 12, the proposed LCNN-CMGF method can highlight semantic objects corresponding to real categories. This shows that the proposed LCNN-CMGF method has a strong ability to locate and recognize objects. In addition, the proposed LCNN-CMGF method can better cover semantic objects and has a wide highlight range. Next, t-distributed stochastic neighbor embedding visualization (T-SNE) method is used to visualize the proposed LCNN-CMGF method and further evaluate the performance of the proposed LCNN-CMGF method. T-SNE is a nonlinear dimensionality reduction algorithm, which usually maps high dimensions to two-dimensional or three-dimensional space for visualization, which can well evaluate the classification effect of the model. The RSSCN7 and UCM21 datasets are used for visualization experiments, and the experimental results are shown in Figure 13.
As shown in Figure 13, there is no confusion between single semantic clusters on the UCM21 dataset and the RSSCN7 dataset, which means that the proposed LCNN-CMGF method has better global feature representation, increases the separability and relative distance between single semantic clusters, and can more accurately extract the features of remote sensing scene images and improve the classification accuracy.
In addition, a randomized prediction experiment is performed on the UCM21 dataset using the LCNN-CMGF method. The results are shown in Figure 14. From Figure 14, we can see that the LCNN-CMGF method has more than 99% confidence in remote sensing image prediction, and some of the predictions even reach 100%. This further proves the validity of the proposed method for remote sensing scene image classification.

Conclusions
In this paper, a lightweight convolutional neural network based on channel multigroup fusion (LCNN-CMGF) was proposed to classify remote sensing scene images. In the proposed LCNN-CMGF method, the three-branch downsampling structure was designed to extract shallow features from remote sensing images. Channel multi-group fusion structure was presented to efficiently extract deep and abstract features of remote sensing images. The channel multi-group fusion structure utilizes channel fusion of adjacent features to reduce the lack of information exchange between groups caused by group convolution. The experimental results show that the proposed LCNN-CMGF method can achieve higher classification accuracy with fewer parameters and computational complexity than some state-of-the-art methods. Especially on UCM21 datasets, the OA value of this method is as high as 99.52%, which surpasses most of the existing advanced methods. Future work is to find a more effective convolution method to reduce the loss of feature information as much as possible while preserving the lightweight of the network.