A Convolutional Neural Network Based on Grouping Structure for Scene Classiﬁcation

: Convolutional neural network (CNN) is capable of automatically extracting image features and has been widely used in remote sensing image classiﬁcations. Feature extraction is an important and difﬁcult problem in current research. In this paper, data augmentation for avoiding over ﬁtting was attempted to enrich features of samples to improve the performance of a newly proposed convolutional neural network with UC-Merced and RSI-CB datasets for remotely sensed scene classiﬁcations. A multiple grouped convolutional neural network (MGCNN) for self-learning that is capable of promoting the efﬁciency of CNN was proposed, and the method of grouping multiple convolutional layers capable of being applied elsewhere as a plug-in model was developed. Meanwhile, a hyper-parameter C in MGCNN is introduced to probe into the inﬂuence of different grouping strategies for feature extraction. Experiments on the two selected datasets, the RSI-CB dataset and UC-Merced dataset, were carried out to verify the effectiveness of this newly proposed convolutional neural network, the accuracy obtained by MGCNN was 2% higher than the ResNet-50. An algorithm of attention mechanism was thus adopted and incorporated into grouping processes and a multiple grouped attention convolutional neural network (MGCNN-A) was therefore constructed to enhance the generalization capability of MGCNN. The additional experiments indicate that the incorporation of the attention mechanism to MGCNN slightly improved the accuracy of scene classiﬁcation, but the robustness of the proposed network was enhanced considerably in remote sensing image classiﬁcations.


Introduction
With the rapid advance of remote sensing and earth observation technology, high spatial resolution [1,2] (HSR) remote sensing (RS) imagery with sub-meter level spatial resolution or even very high resolution (VHR) RS imagery [3,4] with centimeter-level resolution became widely available and easily accessible to the public. With the growing amount of data, there is a practical need for a faster and more accurate automated approach to extract their semantic content information and to identify and classify land use and land cover (LULC) types in those images. RS image scene classification [5][6][7] is one crucial way to help alleviate the problem mentioned above since it automatically assigns semantic labels to an RS image scene and has been widely studied due to its vital contributions in land resources planning [8], disaster monitoring [9], urban planning [10], object detection [11], and many other RS applications [12][13][14][15].
Effective feature extraction is one of the key steps in image classification. Traditional machine learning methods need to design features manually and then transform these features into vectors to describe features, such as Scale-invariant Feature Transform (SIFT) features [16]. Combined with clustering methods like K-Means, these features are mapped into a visual dictionary and generate a feature histogram for each image with bag of visual word (BOVW) [17]. However, this method relies heavily on handcrafted features, and the clustering method also requires expert experience and knowledge. In recent years, the convolutional neural network has achieved remarkable progress in natural image classification. AlexNet [18] used a large number of convolution kernels for feature extraction, while VGGNet [19] further increased the width and depth of the network and enlarged the model volume. GoogLeNet [20] used convolution kernels of different sizes to construct the inception structure, it can extract multi-scale features and used the global pooling layer to replace the full connection layer, which reduced the amount of computation and improved the performance of the network. ResNet [21], committed to solving the problem of vanishing gradient when the network is too deep, used the residual structure to solve the problem of model degradation. The newly emerged attention mechanism also promotes the development of deep learning; it can learn new features based on the input features, so as to improve the network performance. The following networks all adopted the advantages of the previous networks and got improved based on them: DenseNet [22] integrated the features of the front layer, SENet [23] defined the channel weight relationship, strengthened the useful information, and suppressed the useless information. SKNet [24] used multi-scale convolution and can adaptively adjust the convolution feature map's weight. ResNeXt [25] was among the first ones that attempted to use multiple groups of convolutions for feature extraction.
Compared with natural images, remote sensing image scenes are more complex, a single scene is usually mixed with many different kinds of objects [26][27][28]. Due to the inconsistency of spatial resolution, the objects' spatial scales are not the same. Some ground objects may have significant similarities in the spectrum [29,30]. Therefore, methods that ensure the model extract effective features are the focus and also the difficulty for remote sensing image scene classification. At present, the feature extraction in remote sensing image scene classification research is mainly developed on the basis of CNN models. Han et al. [31] improved a pre-trained AlexNet with spatial pyramid pooling (SPP) that was used for feature fusion. Gong et al. [32] introduced an anti-noise transfer network based on pre-trained VGGNet. Li et al. [33] were inspired by the inception structure of GoogleNet and designed a multi-scale feature extraction method which is used to solve the problem of the object size varying considerably in the same category image. The attention mechanism, therefore, was designed to change the weight of feature maps to improve network performance [34,35]. Since then, spatial and channel attention mechanisms were applied to feature extraction [36][37][38][39]. However, these methods still have some disadvantages. On the one hand, Multi-scale features can be expanded by the superposition of several small convolutional kernels. On the other hand, these models do not yet take process of network internal feature extraction into account. Therefore, it is necessary to understand the details of feature extraction by tremendous amount of convolutional kernels.
To discern the internal feature extraction process of the model, we proposed MGCNN models that embed group convolution blocks in each convolution layer and used ResNet-50 [20] as the backbone network structure to account for this issue in the present study. The grouping process was designed to divide the input into different groups to perform convolution separately in each group, and then to combine each group's convolutional results to improve the performance of the model in scene classifications. In group convolution blocks, we introduced hyper-parameter C to control the number of groups and paths. The number of paths, which affected the accuracy of the model through several experiments, was regarded as hyper-parameter. To further explore the performances of attention mechanism in remote sensing scene classifications, we introduced attention structure into MGCNN and formed a variation of MGCNN, called MGCNN-A. This structure can automatically train the weight of the feature maps based on grouping convolution. In short, the major works with scientific contributions we made in this study were summarized below: • A convolutional neural network framework, namely MGCNN, was proposed based on group convolution scheme by introducing a hyper-parameter C to divide the feature extraction path into multiple channels for improving efficiency of feature extraction meanwhile enriching the feature space. • Attention mechanism and group convolution scheme was explored and incorporated into the proposed MGCNN, and a modified MGCNN, namely MGCNN-A, was developed. The influence of incorporating grouping and attention mechanism in feature extraction on the performance of MGCNN-A, as well as the effects of hyperparameters C being introduced in the model under the fixed feature map channel numbers, were comprehensively investigated. At the same time, the features extracted by MGCNN and MGCNN-A are compared by discussions.
The rest of this paper is organized as follows. In Section 2, we introduce the proposed MGCNN and MGCNN-A in detail. Experiments and results with our proposed models on two datasets are given in Section 3. In Section 4, discussions about the proposed model are presented, followed by the conclusion and future work which are discussed in Section 5 at the end.

Framework of Model
As shown in Table 1, ResNet-50 [20], was adopted as a backbone architecture to develop our proposed models MGCNN and MGCNN-A. In the original ResNet-50 [20], the number of convolution kernels in each layer was 64, 64, 128, 256, 512, respectively. As shown in the third column of the Table 1, we reduced the number of convolution kernels to avoid over fitting. In our proposed models, grouped convolution and grouped attention blocks were embedded into each convolutional layer of MGCNN and MGCNN-A to enrich the features extracted. Finally, we used global average pooling to replace the fully connected (FC) layers to reduce the number of parameters. The parameter C indicates that the input tensor is divided into C groups, while A indicates that the attention structure is added to each group. Figure 1 illustrates the size of output tensor after convolution of each layer. The parameter k in the graph is the convolution kernel size, s is the stride size and repeat is the number of grouped attention block. The last four convolution layers are composed of several convolution blocks (blue block), in which grouped convolution block and grouped attention block are used.

Grouped Convolution Block
Grouped convolution was first used in AlexNet [17], which utilized two GPUs for training the model. According to our experiments, multiple paths were thought favorable for extracting features efficiently. As shown in Figure 2, we added a hyper-parameter C representing the number of groups to divide the input tensor into several groups. In each group, we used a 1 × 1 kernel succeeding a 3 × 3 convolution kernel to transform feature maps. After convolution layers, the RELU activation function was applied to adjust the model. Afterwards, a concatenation function was used to combine the outputs from each path. Finally, we complimented the output of a 1 × 1 convolution layer to the input to construct the structure of the short-cut in ResNet [20].

Channel Attention
In typical convolution neural networks, the weight for each feature channel is set consistent and usually can only discriminate prominent features. The channel attention structure can automatically train different weights according to different feature maps. As shown in Figure 3, two fully connected layers were designed to learn the weights of neurons, while convolution layers were used to obtain feature information. As a final step, the multiplied outputs of weight for each feature channels were optimized by Sigmoid function and input feature maps were considered as reinforced attention maps. The attention module can be expressed as follow: where A S and A R denote activation function of Sigmoid and RELU, and W 1,2 and f 1,2 refer to the two convolution layers and the two fully connected layers, correspondingly. As can be seen from the formula, input X will be transformed into a feature map after convolution layers. The fully connected layers synthesized the feature maps and were activated by RELU and Sigmoid, which amplified the high-frequency signal. Besides, the multiply of the fully connected layers and corresponding channels magnified the more prominent features.

Grouped Attention Block
Although the attention structure is capable of automatically training the weight of each attention channel, it is challenging to enrich the space of feature maps solely relying on it. Thus, the grouped attention blocks with grouped parameter C were introduced to the attention structure, which structure is displayed in Figure 4. Parameter C that we added here was used to divide each convolution layer into C paths. In each attention group, same as the grouped convolution block, one 1 × 1 succeeding one 3 × 3 convolution kernels were used. After grouped convolution, grouped feature maps were concatenated and stretched into a fully connected layer which was weighted to grouped feature maps. Finally, we adopted the shortcut structure of ResNet [20] and added the convolution result to the input layer.

Data Augmentation and Cross Validation
The cross-validation method, as illustrated in Figure 5, was adopted to prove the validity of our model. The datasets were randomly divided into four groups for crossvalidation purposes: three for training and one for validation. In other words, we trained each model four times and recorded the average of the accuracy. Cross-validation can effectively avoid the problem of high or low accuracy in some specific datasets. This method was implemented in RSI-CB [40] and UC Merced Land Use [41] datasets.  The number of images varies from 198 to 1331 within each category of the RSI-CB dataset. The imbalance of data volume between each category will lead to the model being more inclined to classify an image from a low-volume category as an image from high-volume categories, so the training loss can be reduced. Nevertheless, this would negatively affect the performance of our proposed model; therefore, several algorithms [42], including crop, rotate, flip, and so on, were used to balance the volume between each category. Through the preliminary experiment, it was observed that there was a severe over fitting problem when three-fourths of the UC Merced Land Use dataset was used for training. This indicated that with small training data it was difficult to reflect the actual distribution of categories. Data augmentation methods can eliminate the amount of random noise that was easily learned by the neural network in a small dataset.

Overall Accuracy and Confusion Matrix
The overall accuracy (OA) is an index to measure the proportion of correct prediction individuals in the whole test data set, which can well reflect the quality of the model. In the confusion matrix, each row represents the actual category, and each column represents the forecast category. It can easily reflect the result of wrong and missing points of each category. The way for the calculation of OA can be expressed as follow: where P ij is the correct prediction of individual, and n, k represents the total number of each category and the total number of categories. T is the total number of test data set.

Datasets
To evaluate the performance of the proposed model, RSI-CB and UC-Merced datasets were used as benchmarks for model training. Two introduced hyper-parameters C were tuned on the RSI-CB dataset. The effectiveness and performance of our proposed networks were tested on a smaller dataset, i.e., the UC-Merced dataset. The RSI-CB dataset contains 35 categories with 24,747 images in total. Images were not evenly distributed among 35 categories, with 1331 images within a single category as the maximum and 198 as the minimum. Each image in the dataset has a 0.3-3 m spatial resolution with a dimension of 256 × 256 pixels. Sample images of each category within this dataset are shown in Figure 6. The UC-Merced Land Use dataset is widely used as a benchmark dataset for evaluating the performance of deep learning models regarding tasks of remote sensing scene classification. It consists of 21 categories with 100 pictures in each category. Each picture has a 0.3 m spatial resolution with a dimension of 256 × 256 pixels. Figure 7 exhibits sample images of each category in this dataset.

Experimental Setup
The experiments were implemented under the Tensorflow framework on an NVIDIA GeForce RTX 2080Ti GPU. Data augmentation algorithms were applied to all images, and all images were cropped to 256 × 256 pixels for model input. We used a gradient descent optimizer with a decaying learning rate. The initial learning rate was 0.1, the exponential decay rate was 0.96 every 300 iterations, and the batch size was 32. The maximum iteration was set to 40,000. Table 2 lists overall accuracy (OA) between the performances applying and not applying data augmentation on three different base networks. It can be noted that data augmentation posed more significant effect on VGGNet-16 (8% increase) compared to the other two networks (about 2% increase respectively). Moreover, ResNet-50 achieved the highest OA (94.930%), about 1.139% higher than the network ranked in second for OA: GoogLeNet-22 (93.791%). Although VGGNet-16 benefited the most from data augmentation, its OA was significantly lower than the other two networks. Figure 8 exhibits a confusion matrix (CM) of ResNet-50 that ignored accuracy below 0.001. It was evident that the model was not able to recognize bare land (class 5) from the desert (class 12). Both of them obtained lower accuracy compared to the others. Slight confusions also existed between other categories since OA was calculated as the combined average accuracy of each category.

MGCNN Experiment
Hyper-parameter C is the core parameter of the MGCNN model. It can be observed from Table 3 that grouping can improve performance. OA increased by about 2% after grouping compared with MGCNNs and ResNet-50. Specifically, the highest OA was obtained by MGCNN-C4 (96.881%), slightly higher than that of MGCNN-C2 (96.859%). The obtained OA of MGCNN-C8 and MGCNN-C16 suggested that too many groups embedded in the neural network did not lead to better performance of the network. The CM of MGCNN as shown in Figure 9 indicates that the accuracy obtained for bare land (class 5) and river (class 25) are lower than those of other categories with MGCNN.

MGCNN-A Experiment
To further explore the performance of grouping, we added the now trending attention structure to this new model. The OA of MGCNN-A with different combinations of hyper-parameter C is shown in Table 4. The best performance among combined models of MGCNN-A and attention structure, which was MGCNN-A4, only obtained a 1.36% performance gain compared to ResNet-50. The decline of OA can be attributed probably to that with the number of groups increases, the depth of the feature map extracted from each group becomes smaller. In general, attention structure seemed not work well in MGCNN-A. Figure 10 presents the CM of MGCNN-A4 obtained through experiments described previously.

Experiment on UC-Merced Dataset Data Augmentation Comparative Experiment
As shown in Table 5, data augmentation can effectively improve the classification accuracy among the three models. The OA of GoogLeNet-22 and ResNet-50 increased by about 7%, while VGGNet-16 only increased by 3%. ResNet-50 performed the best among the three models. It can be observed that in Figure 11, the agricultural (class 1), beach (class 4), chaparral (class 6), forest (class 8), harbor (class 11), mobile home park (class 14), and river (class 16) are classified almost 100% in accuracy. The other scenes are classified about 85% in accuracy except for intersection (class 12) and tennis court (class 21).

MGCNN Experiment
We also tested our model with the UC-Merced dataset. Table 6 lists the OA of MGC-NNs achieved in the experiment, which about 2% increase of accuracy can be observed after grouping. MGCNN-C4 achieved higher OA than other groups, which was attributed to too many groups might reduce the model's efficiency, this was also demonstrated by MGCNN-C16. It is observed from Figure 12 that MGCNN-C4 achieved more than 95% accuracy in classification of agricultural (class 1), airplane (class 2), and six other scenes. Meanwhile, the errors of the classified buildings (class 5), dense residential (class 7), intersection (class 12), and other categories are reduced by around 5% compared to those of the ResNet-50 after grouping.

MGCNN-A Experiment
We investigated the performance of MGCNN-A with the UC-Merced dataset either, and the OA achieved in this experiment was listed in Table 7. It can be seen from Table 7 that all MGCNN-A models outperformed the ResNet-50, and the MGCNN-A models benefited from grouping with increased OA about 2% in general. Different grouping methods in MGCNN-A had promoted the model performances around 1.5% regarding achieved OA, among which MGCNN-A4 gained the most benefit on OA increase. It is worthwhile mentioning that attention structure made less impact on OA compared to groupings. The CM of MGCNN-A4, as shown in Figure 13, reveals that some of the scenes are mixed up in classifications by MGCNN-A4, such as buildings (class 5), container (class 9), medium residential (class 13), and tennis court (class 21).

Generalization Capability
Through the above experiments, we found that the grouping convolution could effectively improve the classification accuracy. Meanwhile, the classification accuracy of MGCNN-A with attention mechanism did not seem to have effect in the two datasets. We tested the proposed model between the two datasets. Airplanes and parking lots are the two same categories defined in RSI-CB and UC-Merced datasets. We trained our models with the RSI-CB dataset and then validated our models with the UC-Merced dataset to test generalization capability of our models by classification focused on these two categories. Both MGCNN and MGCNN-A outperformed the ResNet-50 in this experiment, and MGCNN-A4 performed better than MGCNN-C4 as indicated otherwise from previous experiments and exhibited stronger robustness when transferring the model to validate with a different dataset, which was most probably due to the enhanced local obvious features for classification by attention mechanism of the MGCNN-A. Table 8 shows the OA of the three models in the Airplane and Parking lots categories. Figure 14 presents the image scenes that both models failed to identify. It is obvious that smaller objects are more challengeable to be recognized by all three DL models. The reason for this is that pooling layers tended to ignore details.

Feature Extraction
In order to better understand the performance of our model in feature extraction, we visualized the feature layer of the model. As shown in Figure 15a, ResNet-50 extracted some repetitive features. For example, the last two feature maps are very similar. On the contrary, the four groups of feature maps extracted by MGCNN-C4 are more abundant in humble information, which distinguishes the background and scene features of the image (such as aircraft and house) well. In MGCNN-A4, the attention structure in the model enhanced the features more apparent in the image. For example, the features extracted were clearer for the objects with edges easier to recognize in the image. For better understanding why the models' accuracy decreased as the number of grouping increases, we visualized eight groups of the extracted feature maps from MGCNN-A8 as presented in Figure 15d; half of the feature maps are analogous, and the extracted features are very similar to MGCNN-A4.

Limitations
Although the proposed method performed well in feature extraction, there is some limitations in some aspects, such as the recognition of small and similar objects. As examples, Figure 16 presents the classification processes for some similar scenes. As can be seen from Figure 16a,b, bare land and desert display quite similar features in visual characteristics. The first two convolution layers extracted low-level texture and color features during feature extraction of these two scenes while the last three convolution layers synthesized the low-level features. Lastly, the fully connected layer identified the scene of the image with those features, and the classification with labels was completed. The high similarity of the extracted features as shown in Figure 16a,b caused confusion between these two scenes in the fully connected layers. For example, in Figure 16a, bare land is discriminated to be "desert" by the model with a probability of 58.9%. On the contrary, in Figure 16b, the desert is recognized as bare land. As can be seen in Figure 16c that the extracted background features in the scene of the airplane is very similar to that of the airport runway, and the airplanes in Figure 16c are small and therefore were mistakenly recognized as cars by the model, Figure 16c is thus categorized into airport runway and parking lot mistakenly. The extracted features as shown in Figure 16d were rather complicated that the first convolution layer accurately extracted those highfrequency signals; however, the model identified these high-frequency signals as airplanes or containers, and the background was identified as highways by mistake. From the above analysis, we can draw the conclusions as follows: (1) If two scenes both without high-frequency signals and the backgrounds of these two scenes are similar (as examples in Figure 16a,b), these two scenes would easily make the trained models confused to recognize the classes correctly; (2) although MGCNN-A is capable of extracting the small objects in the scene, it is yet difficult to label their categories correctly (as examples in Figure 16c,d).

Conclusions
In the present study, two grouped convolutional neural networks aimed for remotely sensed image scene classifications, namely, MGCNN and MGCNN-A developed on the basis of ResNet-50, were proposed and tested with RSI-CB and UC-Merced datasets. Firstly, data augmentation scheme was experimentally applied to three popularized convolutional neural networks, i.e., VGGNet-16, GoogLeNet-22, and ResNet-50, to investigate their performances in remotely sensed image scene classifications; the results strongly suggested the effectiveness of data augmentation in improving performance of classifications with these networks, and the ResNet-50 performed the best according to several criterions. To evaluate the performances of the proposed networks developed from ResNet-50 as backbone, several rigorously designed experiments were conducted with the proposed models by using RSI-CB and UC-Merced datasets to evaluate their performances. The experimental results indicated that grouping enabled the proposed models to learn more abundant features, therefore, benefiting the model in distinguishing different remotely sensed image scenes more effectively. Although MGCNN-A is not much better than MGCNN, it can be seen from the discussion that MGCNN-A is more robust in some categories. Although our proposed MGCNN and MGCNN-A models outperformed the similar ones comparably, some limitations yet remained in classification of some scenes with similar backgrounds but without high-frequency signals. Future attempts will be focused on adjusting our proposed models with feature fusion and transferring them into segmentation tasks.
Author Contributions: X.W. and Z.Z. designed this study. X.W. performed the data collection, model derivation, and validation with help from Z.Z. and Y.Y. The corresponding author W.Z. is supervisor of this work and contributed with continuous guidance during this work. X.W. and Z.Z. jointly wrote this manuscript, and the manuscript was edited by W.Z., Q.X., and C.Z. All authors have read and agreed to the published version of the manuscript.