An Efficient and Lightweight Convolutional Neural Network for Remote Sensing Image Scene Classification.

Classifying remote sensing images is vital for interpreting image content. Presently, remote sensing image scene classification methods using convolutional neural networks have drawbacks, including excessive parameters and heavy calculation costs. More efficient and lightweight CNNs have fewer parameters and calculations, but their classification performance is generally weaker. We propose a more efficient and lightweight convolutional neural network method to improve classification accuracy with a small training dataset. Inspired by fine-grained visual recognition, this study introduces a bilinear convolutional neural network model for scene classification. First, the lightweight convolutional neural network, MobileNetv2, is used to extract deep and abstract image features. Each feature is then transformed into two features with two different convolutional layers. The transformed features are subjected to Hadamard product operation to obtain an enhanced bilinear feature. Finally, the bilinear feature after pooling and normalization is used for classification. Experiments are performed on three widely used datasets: UC Merced, AID, and NWPU-RESISC45. Compared with other state-of-art methods, the proposed method has fewer parameters and calculations, while achieving higher accuracy. By including feature fusion with bilinear pooling, performance and accuracy for remote scene classification can greatly improve. This could be applied to any remote sensing image classification task.


Introduction
In recent years, with the development of Earth observation technology, remote sensing image resolution has continuously improved, datasets have become larger, and applications have continued to expand. Therefore, rapid and efficient interpretation of these images has important applications [1][2][3][4][5].
Classification and scene recognition are important methods for remote sensing image interpretation. Scene classification refers to dividing the image into blocks and labeling each with an appropriate category (such as residential areas, farmland, rivers, and forests) according to the makeup of the blocks. This is helpful for image management, retrieval, analysis, detection, and recognition of typical targets. When resolution increases, images become more diverse, allowing for fine-grained classification and identification. At the same time, the details of high-resolution remote sensing images are richer, the features in the image are more diverse, and the objects on the ground are usually staggered. The similarity between images of the same type decreases while the difference of same types increases significantly [6,7]. In addition, it is necessary to consider rotation and positional relationship among targets in the image. These problems bring challenges to high-precision scene classification.
Sensors 2020, 20, 1999 3 of 25 applicability. Due to the addition of handcrafted features, the classification pipeline is also divided into two stages, which cannot be trained or implemented in an end-to-end manner. There are still few methods available for remote sensing image scene classification and object detection (such as methods proposed by Zhang et al. [60], Zhang et al. [61] Teimouri et al. [62], etc.,) so far.
In view of the above problems, therefore, this study introduces the idea of feature fusion in the bilinear model [63,64] with the CNN MobileNetv2 [65] and designs an efficient and lightweight CNN called BiMobileNet for remote sensing image classification. The architecture designed can not only achieve higher classification accuracy but also has fewer calculation numbers and parameters. The main contributions of this article are as follows: (1) The idea of a bilinear model in fine-grained visual recognition is introduced into remote sensing image classification, which enhances the ability of the CNN to identify different scene types. Compared with the state-of-the-art methods for remote sensing image scene classification, the proposed method can obtain superior performance. (2) By integrating the lightweight CNN MobileNetv2 and the feature fusion method of the bilinear model, the method in this study considers both the advantages of a lightweight structure and high accuracy. Compared with other state-of-the-art methods, the proposed architecture has fewer parameters and calculations. Therefore, image classification speed will be higher, rendering it more viable for production purposes and applications. (3) This study proposes that both the accuracy and complexity of the method should be considered simultaneously during classification. The method should be evaluated comprehensively in three aspects: accuracy, parameter, and calculation. In addition, we find that most methods use the UC Merced dataset with a training ratio of 80%, and the classification accuracy is close to saturation. We provide an accuracy benchmark when the training ratio is less than 30%.
The remainder of this study is organized as follows. In Section 2, we illustrate the datasets used and the proposed architecture in detail. In Section 3, results and analysis of experiments on several datasets are detailed. Section 4 discusses results and Section 5 concludes the study with a summary of our method.
Compared with the UC Merced dataset, the AID dataset extends the number of scene categories to 30, with categories more finely classified. Each category contains ~220 to 440 RGB images; the total number of images in the dataset is 10,000. Image size is 600 × 600 pixels and the resolution is ~0.5-8 m. Figure 2 shows representative images of each class. More detailed information on this dataset can be found at http://www.lmars.whu.edu.cn/xia/AID-project.html.
(1) ( NWPU-RESISC45 is a large-scale remote sensing image dataset, which further expands the number of categories and images. It contains 45 categories, each consisting of 700 RGB images with a size of 256 × 256 pixels. Image resolution ranges from 0.2-30 m. In addition, images cover more than 100 countries and regions, including different weather, seasons, spatial resolution, and occlusion factors. Compared with other datasets, NWPU-RESISC45 images are more complex and diverse. Figure 3 shows representative images of each class. More detailed information can be found at http://www.escience.cn/people/JunweiHan/NWPU-RESISC45.html. Compared with the UC Merced dataset, the AID dataset extends the number of scene categories to 30, with categories more finely classified. Each category contains~220 to 440 RGB images; the total number of images in the dataset is 10,000. Image size is 600 × 600 pixels and the resolution is~0.5-8 m. Figure 2 shows representative images of each class. More detailed information on this dataset can be found at http://www.lmars.whu.edu.cn/xia/AID-project.html. Compared with the UC Merced dataset, the AID dataset extends the number of scene categories to 30, with categories more finely classified. Each category contains ~220 to 440 RGB images; the total number of images in the dataset is 10,000. Image size is 600 × 600 pixels and the resolution is ~0.5-8 m. Figure 2 shows representative images of each class. More detailed information on this dataset can be found at http://www.lmars.whu.edu.cn/xia/AID-project.html. (1) NWPU-RESISC45 is a large-scale remote sensing image dataset, which further expands the number of categories and images. It contains 45 categories, each consisting of 700 RGB images with a size of 256 × 256 pixels. Image resolution ranges from 0.2-30 m. In addition, images cover more than 100 countries and regions, including different weather, seasons, spatial resolution, and occlusion factors. Compared with other datasets, NWPU-RESISC45 images are more complex and diverse. Figure 3 shows representative images of each class. More detailed information can be found at http://www.escience.cn/people/JunweiHan/NWPU-RESISC45.html. NWPU-RESISC45 is a large-scale remote sensing image dataset, which further expands the number of categories and images. It contains 45 categories, each consisting of 700 RGB images with a size of 256 × 256 pixels. Image resolution ranges from 0.2-30 m. In addition, images cover more than 100 countries and regions, including different weather, seasons, spatial resolution, and occlusion factors. Compared with other datasets, NWPU-RESISC45 images are more complex and diverse.   Table 1 summarizes the three datasets. Given the differences between each dataset, experiments on each will help verify the robustness and generalization of our proposed method.

Method
The method in this study integrates the idea of a lightweight CNN MobileNetv2 and a bilinear model for fine-grained visual recognition. MobileNet [66] is a lightweight CNN proposed to apply deep learning on mobile and edge devices. It greatly reduces CNN parameters and calculations by using depthwise separable convolution. Although the classification accuracy of MobileNet on ImageNet is slightly lower than that of deep CNNs such as ResNet50, it has the unique advantages of a smaller size, fewer parameters, fewer calculations and can be used on mobile and embedded devices. MobileNetv2 introduced an inverted residual and linear bottleneck, further compressing parameters and calculations, improving performance. The bilinear model is a widely used method for fine-grained visual recognition. In the bilinear model, two parallel CNNs (which are separated by  Table 1 summarizes the three datasets. Given the differences between each dataset, experiments on each will help verify the robustness and generalization of our proposed method.

Method
The method in this study integrates the idea of a lightweight CNN MobileNetv2 and a bilinear model for fine-grained visual recognition. MobileNet [66] is a lightweight CNN proposed to apply deep learning on mobile and edge devices. It greatly reduces CNN parameters and calculations by using depthwise separable convolution. Although the classification accuracy of MobileNet on ImageNet is slightly lower than that of deep CNNs such as ResNet50, it has the unique advantages Sensors 2020, 20, 1999 6 of 25 of a smaller size, fewer parameters, fewer calculations and can be used on mobile and embedded devices. MobileNetv2 introduced an inverted residual and linear bottleneck, further compressing parameters and calculations, improving performance. The bilinear model is a widely used method for fine-grained visual recognition. In the bilinear model, two parallel CNNs (which are separated by the last fully connected layers and classification layers) are used as feature extractors to obtain two deep features of the same image. The two features are then in a bilinear pooling instead of a connection, summation, or maximum pooling. Bilinear pooling is an efficient feature fusion strategy. Along with its concise form and gradient calculation method, the bilinear model can also be trained end-to-end and has excellent classification performance in fine-grained visual recognition. In the following sections, we introduce the depthwise separable convolution, linear bottleneck, inverse residual, bilinear model, and the network architecture.

Depthwise Separable Convolution
The core idea of depthwise separable convolution ( Figure 4) is to divide the traditional standard convolution operation into two steps: depthwise convolution and pointwise convolution. Assuming that the size of the input feature maps is D K × D K × M, using N convolution kernels of size D K × D K × M to perform the convolution operation on the input feature maps, N feature maps of size D R × D R can be directly obtained, where D R is the width and height of the input feature maps, M is the number of channels of the input feature maps, D K is the width and height of the convolution kernels, N is the number of convolution kernels, and D R is the width and height of the output feature maps. When using depthwise separable convolution to operate a convolution on feature maps of size D R × D R × M, firstly M convolution kernels of size D K × D K × 1 are used to convolve with each channel of the feature maps separately. The size of the output feature maps is D R × D R × M. Depthwise convolution only changes the width and height of the original feature maps but does not change the number of channels. To increase the channels of the feature maps, pointwise convolution can be used after depthwise convolution. In the process of pointwise convolution, N convolution kernels of size 1 × 1 × M are used to operate convolution on the feature maps to obtain N feature maps of size D R × D R . Finally, the size of the feature maps generated by standard convolution and by depthwise separable convolution are the same, but the number of parameters and calculations has changed. The standard convolution calculation is In a CNN, when the size of the convolution kernels is 3 × 3, the depthwise separable convolution calculation is~1/9 of the standard convolution calculation. In addition, in order to further reduce the network parameters, MobileNet introduces two parameters: channel multiplier and resolution multiplier. The channel multiplier, α, is used to proportionally expand or reduce the number of feature channels; the resolution multiplier, ρ, is used to proportionally enlarge or reduce the size of the feature maps. The calculation of the depthwise separable convolution after reducing the channel number and size of feature maps is

Linear Bottleneck
The introduction of the linear bottleneck is to solve the information loss caused by using activation functions such as ReLu (rectified linear unit) in CNNs. The activation function in a CNN generally performs a non-linear transformation on the feature maps of the input. The non-linearity allows the neural network to approximate any arbitrary non-linear function and enhances the network's ability to express information. The ReLu activation function outputs are zero if the input is negative; for positive inputs, the output is dependent on a linear transformation. The ReLu function, therefore, increases the sparsity of the network (outputs of zero are ignored) and reduces the interdependence between parameters, thus reducing the possibility of model overfitting. However, the ReLu function will also cause large losses of information for features with small channels, as the process of feature dimension reduction is also a process of feature compression. The essence of the linear bottleneck, therefore, is that after the pointwise convolutional layer and the batch normalization layer, the feature maps are directly passed to the next convolutional layer without using a non-linear activation function ( Figure 5).

Linear Bottleneck
The introduction of the linear bottleneck is to solve the information loss caused by using activation functions such as ReLu (rectified linear unit) in CNNs. The activation function in a CNN generally performs a non-linear transformation on the feature maps of the input. The non-linearity allows the neural network to approximate any arbitrary non-linear function and enhances the network's ability to express information. The ReLu activation function outputs are zero if the input is negative; for positive inputs, the output is dependent on a linear transformation. The ReLu function, therefore, increases the sparsity of the network (outputs of zero are ignored) and reduces the interdependence between parameters, thus reducing the possibility of model overfitting. However, the ReLu function will also cause large losses of information for features with small channels, as the process of feature dimension reduction is also a process of feature compression. The essence of the linear bottleneck, therefore, is that after the pointwise convolutional layer and the batch normalization layer, the feature maps are directly passed to the next convolutional layer without using a non-linear activation function ( Figure 5).

Inverted Residual Block
MobileNetv2 uses the feature shortcut connection idea in the ResNet structure to fuse feature maps between different convolutional layers ( Figure 6). When ResNet performs shortcut feature connection, it first uses pointwise convolution to compress the channel number of the input feature maps (usually to 0.25 times the original number). Feature maps after compression are passed to a standard convolution module where the channel number of the feature maps is in constant. The number of channels is restored to the original number using another pointwise convolution. Finally, the feature maps are added to the input feature maps. The inverted residual adopted in MobileNetv2 is the opposite: a complete inverted residual structure first performs a pointwise convolution to expand the number of feature channels to m times the original number (m is an integer greater than 1; in MobileNetv2 the value of m is 6), and then performs depthwise and pointwise convolution. In the second pointwise convolution, the feature map channels are expanded to the original number and then the obtained feature maps are added to the original feature maps. Similarly, the ReLu activation function is no longer used after the second pointwise convolution. The design of the inverted residual structure not only has good memory efficiency, but also improves network performance.

Inverted Residual Block
MobileNetv2 uses the feature shortcut connection idea in the ResNet structure to fuse feature maps between different convolutional layers ( Figure 6). When ResNet performs shortcut feature connection, it first uses pointwise convolution to compress the channel number of the input feature maps (usually to 0.25 times the original number). Feature maps after compression are passed to a standard convolution module where the channel number of the feature maps is in constant. The number of channels is restored to the original number using another pointwise convolution. Finally, the feature maps are added to the input feature maps. The inverted residual adopted in MobileNetv2 is the opposite: a complete inverted residual structure first performs a pointwise convolution to expand the number of feature channels to m times the original number (m is an integer greater than 1; in MobileNetv2 the value of m is 6), and then performs depthwise and pointwise convolution. In the second pointwise convolution, the feature map channels are expanded to the original number and then the obtained feature maps are added to the original feature maps. Similarly, the ReLu activation function is no longer used after the second pointwise convolution. The design of the inverted residual structure not only has good memory efficiency, but also improves network performance.

Bilinear Model
Lin et al. [63] first proposed a bilinear CNN model (B-CNN) in fine-grained visual recognition tasks producing excellent performance. The core idea is to use two parallel CNNs to extract features from the same image and then merge the two features using bilinear pooling to obtain a new feature vector (Figure 7). A standard bilinear model ℬ consists of four components: ℬ = (fA, fB, , ), where fA and fB are two feature extraction functions based on CNNs, which are used to extract features of the same image, is a pooling function, and is a classification function. When two CNNs extract features from the same image I and perform bilinear pooling in position l, the outer product operation is used, and the calculation process is as follows: = vec( ( )), when the size of output feature maps of the input image was DW × DH × C, feature maps of size DW × DH × C 2 could be obtained through an outer product operation. Each feature map was then summed and pooled globally to obtain a bilinear feature x of size C 2 ; a square root operation (Equation (4)) and a normalization operation (Equation (5)) on the bilinear feature were then carried out to obtain a bilinear vector. Finally, logistic regression or a support vector machine (SVM) was used for classification with the bilinear vector. The bilinear model is not only simple in form and procedure but also enables end-to-end training and testing. It also has excellent performance in fine-grained classification tasks.

Bilinear Model
Lin et al. [63] first proposed a bilinear CNN model (B-CNN) in fine-grained visual recognition tasks producing excellent performance. The core idea is to use two parallel CNNs to extract features from the same image and then merge the two features using bilinear pooling to obtain a new feature vector (Figure 7). A standard bilinear model B consists of four components: B = (f A , f B , P, C), where f A and f B are two feature extraction functions based on CNNs, which are used to extract features of the same image, P is a pooling function, and C is a classification function. When two CNNs extract features from the same image I and perform bilinear pooling in position l, the outer product operation is used, and the calculation process is as follows: when the size of output feature maps of the input image was D W × D H × C, feature maps of size D W × D H × C 2 could be obtained through an outer product operation. Each feature map was then summed and pooled globally to obtain a bilinear feature x of size C 2 ; a square root operation (Equation (4)) and a normalization operation (Equation (5)) on the bilinear feature were then carried out to obtain a bilinear vector. Finally, logistic regression or a support vector machine (SVM) was used for classification with the bilinear vector. The bilinear model is not only simple in form and procedure but also enables end-to-end training and testing. It also has excellent performance in fine-grained classification tasks. Sensors 2020, 20, x FOR PEER REVIEW 10 of 25

Proposed Architecture
The integration of the structure of MobileNetv2 and the method of feature fusion in the bilinear model, the network structure of BiMobileNet, is shown in Figure 8. The network includes main three parts: feature extraction, bilinear pooling of features to obtain a bilinear vector, and classification of bilinear features. The backbone network of the feature extraction layer utilizes MobileNetv2 but does not use all layers. Instead, we removed the last three layers: one convolutional layer, one average pooling layer, and one classification layer. This is because the size of feature maps of the last convolutional layer is 7 × 7 × 1280. This still contained too many parameters; we wanted as few as possible. The structure information of BiMobileNetv2 is shown in Table 2. The feature extraction layer contains a convolutional layer and seven bottlenecks. Every bottleneck consisted of a linear or inverted residual block. For example, when an image with size 224 × 224 × 3 passes through the feature extraction layer, feature maps with size of 7 × 7 × 320 are obtained. In BiMobileNet we did not use two convolutional neural networks to extract features. Instead the feature maps extracted from the same network were shared to reduce the parameters and calculations in the model.

Proposed Architecture
The integration of the structure of MobileNetv2 and the method of feature fusion in the bilinear model, the network structure of BiMobileNet, is shown in Figure 8. The network includes main three parts: feature extraction, bilinear pooling of features to obtain a bilinear vector, and classification of bilinear features.

Proposed Architecture
The integration of the structure of MobileNetv2 and the method of feature fusion in the bilinear model, the network structure of BiMobileNet, is shown in Figure 8. The network includes main three parts: feature extraction, bilinear pooling of features to obtain a bilinear vector, and classification of bilinear features. The backbone network of the feature extraction layer utilizes MobileNetv2 but does not use all layers. Instead, we removed the last three layers: one convolutional layer, one average pooling layer, and one classification layer. This is because the size of feature maps of the last convolutional layer is 7 × 7 × 1280. This still contained too many parameters; we wanted as few as possible. The structure information of BiMobileNetv2 is shown in Table 2. The feature extraction layer contains a convolutional layer and seven bottlenecks. Every bottleneck consisted of a linear or inverted residual block. For example, when an image with size 224 × 224 × 3 passes through the feature extraction layer, feature maps with size of 7 × 7 × 320 are obtained. In BiMobileNet we did not use two convolutional neural networks to extract features. Instead the feature maps extracted from the same network were shared to reduce the parameters and calculations in the model. The backbone network of the feature extraction layer utilizes MobileNetv2 but does not use all layers. Instead, we removed the last three layers: one convolutional layer, one average pooling layer, and one classification layer. This is because the size of feature maps of the last convolutional layer is 7 × 7 × 1280. This still contained too many parameters; we wanted as few as possible. The structure information of BiMobileNetv2 is shown in Table 2. The feature extraction layer contains a convolutional layer and seven bottlenecks. Every bottleneck consisted of a linear or inverted residual block. For example, when an image with size 224 × 224 × 3 passes through the feature extraction layer, feature maps with size of 7 × 7 × 320 are obtained. In BiMobileNet we did not use two convolutional neural networks to extract features. Instead the feature maps extracted from the same network were shared to reduce the parameters and calculations in the model. Before bilinear fusion of the obtained features, inspired by the hierarchical bilinear model proposed in [64], we used two feature transformation layers on the extracted features, thereby transforming the same feature into two different, but similar, features. The feature transformation layer is essentially a convolutional layer, with a size of 1024 convolution kernels of size k × k (here k is 3). After the feature maps with size 7 × 7 × 320 pass through the feature transformation layer, two kinds of feature maps with size 7 × 7 × 1024 are generated. The original bilinear model used the outer product operation when performing bilinear pooling operations on two different features. This expanded the dimensions of the obtained feature maps by C times (C is the channel number of feature maps); the dimensions of the bilinear feature vector were also expanded C times. When the channel number of the two features is 1024, million-dimensional feature vectors were generated. Not only is the model prone to overfitting, but also the training and implementation time is significant. Using the Hadamard product, instead of the outer product, in the bilinear model keeps the dimensions of the original feature maps unchanged. The Hadamard product is the multiplication of the elements at the corresponding positions in the two matrices of the same order and does not change the dimension of the matrices. In the bilinear fusion in BiMobileNet, two feature maps with size 7 × 7 × 1024 are processed by the Hadamard product to generate bilinear features with size 7 × 7 × 1024, keeping the feature dimensions unchanged. The average pooling operation is performed on the bilinear features with the size 7 × 7 × 1024; a bilinear feature vector with a size of 1 × 1 × 1024 is obtained. The classification layer is a fully connected layer. Its input is a bilinear feature vector with size of 1 × 1 × 1024 and the probability of each category is the output.

Implementation Details
Before the experiment, each dataset was divided into training and test sets. In order for comparisons with other methods, we adopted different training ratios for the experiments. The training and test sets were chosen randomly from the original dataset. Every experiment was performed five times. The mean and standard deviation of the five results were calculated. By rotating the training images 90 • , 180 • , and 270 • clockwise, horizontal flip and vertical flip, we expanded the training data six-fold. This augmentation helped to generalize the CNN models.
We utilized the open source deep learning framework PyTorch to build BiMobileNet. BiMobileNet can be trained end-to-end, using stochastic gradient descent to update its parameters. As BiMobileNet shares part of the MobileNetv2 network structure, before training, we used MobileNetv2's pre-trained weight on ImageNet to initialize parameters. The hyperparameter settings in BiMobileNet were as follows: the initial learning rate of the feature extraction layers was 0.01, while the initial learning rate of the bilinear pooling layer and the classification layer was 0.1. Every 10 epochs, the learning rate was reduced by 0.5 times. The momentum and weight decay were 0.9 and 0.0005, respectively. The training batch size was 32, and the number of training epochs was 100. We use the well-trained model whose training loss is stable to predict the test set. All experiments were performed on a device with Intel Core i7-6900K CPU 64-GB RAM and GeForce GTX1080Ti GPU 11-GB RAM.

Evaluation Protocol
BiMobileNet was comprehensively evaluated from aspects of classification accuracy and model complexity. The accuracy was expressed by two criteria: confusion matrix and overall accuracy. In the confusion matrix, each row represents the true class and each column represents its predicted category. The elements on the diagonal in the confusion matrix represent the classification accuracy of one class and the elements on the non-diagonal CM ij represent the probability that the images from class ith are mistakenly recognized as class jth. The overall accuracy is defined as the number of correctly predicted images divided by the total number of predicted images.
The model complexity includes time complexity and space complexity. Time complexity indicates the number of model operations, determines the training and prediction time of the model, and is represented by floating point operations (FLOPs). The higher the time complexity, the slower the model speed. For the model with high time complexity, model training and prediction time is long, which is less favorable for practical training and application. Space complexity indicates the space size of the model, which can be represented by the model size and the total number of model parameters. A model with considerable parameters needs a large amount of data to train and it is very easy to over fit.

Classification of the UC Merced Dataset
The training ratios were set at 20%, 50%, and 80%. Table 3 shows the classification performance comparison of our architecture compared to the state-of-the-art methods on the UC Merced dataset. By analyzing the overall accuracies obtained by state-of-the-art methods under different training ratios, we find that on the UC Merced dataset, when using 80% of the data for training, the overall accuracy of other state-of-the-art methods are very close to 99.00%. When the training ratio is 80%, the classification accuracy becomes saturated, and it is difficult to improve it further. Using 80% of data for training, BiMobileNet nearly achieves the highest overall accuracy (99.03% compared to 99.05%); using 50% of data for training, BiMobileNet achieves the highest overall accuracy of 98.45% -this is better than many methods which use a training ratio of 80%. When training with 20% data, the classification accuracy of BiMobileNet reaches 96.41%. On one hand, BiMobileNet achieves outstanding performance when the training ratio is 50% (similar result to 80% training ratio); however, it also demonstrates that the classification accuracy is saturated when the training ratio is 80%, and it is difficult to further improve accuracy. We must consider model complexity and other issues with a small training sample. Most of the state-of-the-art methods in Table 3 adopt deep CNNs (such as VGG16, ResNet50, etc.). Generally, deep CNNs with many layers and parameters have a large calculation component (the parameters and calculation are discussed in detail in Section 3.3), and the training of these networks requires a large amount of data. Therefore, it is difficult to train these networks with a small amount of training data, and overfitting often occurs. In addition, for a classification task, it is unreasonable and unrealistic to use 80% of the data for training. This is because annotation is a time-consuming job, and manually labeling 80% of the data is unfeasible. For practical applications, it is necessary to reduce manual annotation as much as possible and improve classification accuracy and efficiency with as little data as possible. However, BiMobileNet can still achieve an accuracy of 96.41% when only 20% of the data is used for training. In order to further verify BiMobileNet performance with little training data, more training ratios (5%, 10%, 15%, 20% and 25%) were used (Table 4). When the training ratio was 5% (e.g., five images randomly selected from each category for training and 95 images are predicted), BiMobileNet achieved an amazing overall accuracy of 86.74%, which was far higher than the results of fine-tuning VGG16, ResNet50 and MobileNetv2 directly. That means when faced with a new larger dataset, we can only label a very small portion of the data for training and then predict the remaining data, which can save a lot of labor and time. When the training ratio was 20%, BiMobileNet achieved an overall accuracy of 96.41% while the method proposed by Chaib [57] reached 92.96%. For the other lower training ratios, BiMobileNet also achieved excellent performance. This demonstrates that the method in this study is not prone to overfitting when using little training data, and has a large accuracy advantage over the deep CNNs in scene classification task using the UC Merced dataset.   show the confusion matrices for training ratios of 5%, 10%, 20%, and 50%, respectively, on the UC Merced dataset. When the training ratio is 5%, 11 of the 21 scene categories achieve a classification accuracy greater than 92%, and only four categories: buildings (0.73), dense residential (0.33), intersection (0.76), and river (0.57), are lower than 80%. Many dense residential images (which have the lowest accuracy) are misidentified as medium residential and mobile home park, as the three types are very similar. With little training data (5 images), it is difficult to distinguish the three categories effectively for a CNN. When the training ratio is 10%, 16 of the 21 scene categories achieve a classification accuracy of greater than 92%; dense residential has the lowest accuracy (67%) but twice as high compared to the 5% training ratio. When the training ratio is 20%, 16 of the 21 scene categories achieve a classification accuracy of greater than 95%. When the training ratio is 50%, 18 of the 21 scene categories achieve a classification accuracy of greater than 98%; the three lower accuracy categories (building, dense residential, and medium residential) still achieve 92%. These three categories have poorer accuracy, because buildings are the main image component. Images were originally annotated based on building density, but this is a subjective and perceptual judgment, with no quantitative standard explaining the lower model accuracy.   12 show the confusion matrices for training ratios of 5%, 10%, 20%, and 50%, respectively, on the UC Merced dataset. When the training ratio is 5%, 11 of the 21 scene categories achieve a classification accuracy greater than 92%, and only four categories: buildings (0.73), dense residential (0.33), intersection (0.76), and river (0.57), are lower than 80%. Many dense residential images (which have the lowest accuracy) are misidentified as medium residential and mobile home park, as the three types are very similar. With little training data (5 images), it is difficult to distinguish the three categories effectively for a CNN. When the training ratio is 10%, 16 of the 21 scene categories achieve a classification accuracy of greater than 92%; dense residential has the lowest accuracy (67%) but twice as high compared to the 5% training ratio. When the training ratio is 20%, 16 of the 21 scene categories achieve a classification accuracy of greater than 95%. When the training ratio is 50%, 18 of the 21 scene categories achieve a classification accuracy of greater than 98%; the three lower accuracy categories (building, dense residential, and medium residential) still achieve 92%. These three categories have poorer accuracy, because buildings are the main image component. Images were originally annotated based on building density, but this is a subjective and perceptual judgment, with no quantitative standard explaining the lower model accuracy. Figure 9. Confusion matrix using a training ratio of 5% on the UC Merced dataset.   Sensors 2020, 20, x FOR PEER REVIEW 15 of 25 Figure 10. Confusion matrix using a training ratio of 10% on the UC Merced dataset. Figure 11. Confusion matrix using a training ratio of 20% on the UC Merced dataset. Figure 11. Confusion matrix using a training ratio of 20% on the UC Merced dataset.

Classification of the AID Dataset
When using the AID dataset, training ratios were set at 10%, 20%, and 50%. Table 5 compares the accuracy of state-of-the-art methods with our approach. Using a training ratio of 50%, the overall classification accuracy of BiMobileNet is 96.87%, which is higher than most other methods. When the training ratio is 20%, the overall accuracy is 94.83%, which is ~1% higher than all other methods. When the training ratio is 10%, the overall accuracy is 92.77%. BiMobileNet produces a similar accuracy to D-CNN [47], GCFs+LOFs [9] and SF-CNN [44] when using a training ratio of 50% but performs ~4.0%, ~2.5%, and ~1.2%, respectively, higher when the training ratio is 20%. The D-CNN, GCFs+LOFs, and SF-CNN networks all adopt VGG16, the parameters and calculations of which are much larger than MobileNetv2 utilized in BiMobileNet.
Comparing Tables 3 and 5, it is noted that the classification accuracy of different methods on the AID dataset is generally lower than the UC Merced dataset for a given training ratio. This is because the AID dataset has more categories, and the data is more diverse, rendering classification more challenging. Figures 13-15 show the confusion matrices when the training ratios of the AID dataset are set to 10%, 20%, and 50%, respectively. When the training ratios are 10% and 20%, 22 and 27 categories (out of 30), respectively, have a classification accuracy greater than 91%. When the training ratio is 50%, the accuracy of most categories is greater than 98%. As training data increases, the classification accuracy of most categories improves significantly. However, the accuracy of the resort class is 76% (10% training ratio), 72% (20%), and 83% (50%)-lower than all other classes. Some images from the resort category are mistaken for park. This is mainly because park and resort have a similar object (buildings, vegetation) distribution. In addition, school and commercial, and center and square have similar features. Consequently, the school and resort classes have relatively low classification accuracies compared with other categories when the training ratio of the AID dataset is set to 50%.

Classification of the AID Dataset
When using the AID dataset, training ratios were set at 10%, 20%, and 50%. Table 5 compares the accuracy of state-of-the-art methods with our approach. Using a training ratio of 50%, the overall classification accuracy of BiMobileNet is 96.87%, which is higher than most other methods. When the training ratio is 20%, the overall accuracy is 94.83%, which is~1% higher than all other methods. When the training ratio is 10%, the overall accuracy is 92.77%. BiMobileNet produces a similar accuracy to D-CNN [47], GCFs+LOFs [9] and SF-CNN [44] when using a training ratio of 50% but performs~4.0%,~2.5%, and~1.2%, respectively, higher when the training ratio is 20%. The D-CNN, GCFs+LOFs, and SF-CNN networks all adopt VGG16, the parameters and calculations of which are much larger than MobileNetv2 utilized in BiMobileNet.
Comparing Tables 3 and 5, it is noted that the classification accuracy of different methods on the AID dataset is generally lower than the UC Merced dataset for a given training ratio. This is because the AID dataset has more categories, and the data is more diverse, rendering classification more challenging. Figures 13-15 show the confusion matrices when the training ratios of the AID dataset are set to 10%, 20%, and 50%, respectively. When the training ratios are 10% and 20%, 22 and 27 categories (out of 30), respectively, have a classification accuracy greater than 91%. When the training ratio is 50%, the accuracy of most categories is greater than 98%. As training data increases, the classification accuracy of most categories improves significantly. However, the accuracy of the resort class is 76% (10% training ratio), 72% (20%), and 83% (50%)-lower than all other classes. Some images from the resort category are mistaken for park. This is mainly because park and resort have a similar object (buildings, vegetation) distribution. In addition, school and commercial, and center and square have similar features. Consequently, the school and resort classes have relatively low classification accuracies compared with other categories when the training ratio of the AID dataset is set to 50%. Table 5. Overall accuracy of the state-of-the-art methods on AID dataset. The highest accuracy for each ratio is bolded.

Method
Published

Classification of the NWPU-RESISC45 Dataset
For the NWPU-RESISC45 dataset, training data ratios were set to 10% and 20%. The classification accuracies of state-of-the-art methods and BiMobileNet are shown in Table 6. The NWPU-RESISC45 dataset has more categories and images than the other two datasets. The overall accuracy of BiMobileNet is 92.06% and 94.08% when the training ratios are 10% and 20%, respectively; this is higher than all but one other methods. When the training ratio is 10%, BiMobileNet accuracy is 2.1%, 1.5% and 0.3% higher than SF-CNN [44], ADFF [46] and DML [49], respectively, and is similar to DDRL-AM [41]. SF-CNN, ADFF, and DML adopt deep CNN VGG16; DDRL-AM adopts deep CNN ResNet18. The parameters and calculation of these two networks are significantly larger than MobieNetv2 used in BiMobileNet. In addition, when the training ratio is 20%, BiMobileNet accuracy is ~1.5%, ~2.1%, ~0.6%, and ~1.6% better than SF-CNN, ADFF, DML and DDRL-AM, respectively. Figures 16 and 17 show the confusion matrixes obtained by BiMobileNet using training ratios of 10% and 20%, respectively, on the NWPU-RESISC45 dataset. When the training ratio is 10%, the classification accuracy of 35 categories is greater than 90%, while the ADFF method classifies 30 Figure 15. Confusion matrix using a training ratio of 50% on the AID dataset.

Classification of the NWPU-RESISC45 Dataset
For the NWPU-RESISC45 dataset, training data ratios were set to 10% and 20%. The classification accuracies of state-of-the-art methods and BiMobileNet are shown in Table 6. The NWPU-RESISC45 dataset has more categories and images than the other two datasets. The overall accuracy of BiMobileNet is 92.06% and 94.08% when the training ratios are 10% and 20%, respectively; this is higher than all but one other methods. When the training ratio is 10%, BiMobileNet accuracy is 2.1%,1.0% and 0.3% higher than SF-CNN [44], GLANet [46] and DML [49], respectively, and is similar to DDRL-AM [41]. SF-CNN, GLANet, and DML adopt deep CNN VGGNet; DDRL-AM adopts deep CNN ResNet18. The parameters and calculation of these two networks are significantly larger than MobieNetv2 used in BiMobileNet. In addition, when the training ratio is 20%, BiMobileNet accuracy is~1.5%,~0.6%, 0.6%, and~1.6% better than SF-CNN, GLANet, DML and DDRL-AM, respectively. Figures 16 and 17 show the confusion matrixes obtained by BiMobileNet using training ratios of 10% and 20%, respectively, on the NWPU-RESISC45 dataset. When the training ratio is 10%, the classification accuracy of 35 categories is greater than 90%. When the training ratio is 20%, the classification accuracy of 41 categories is greater than 90%; for GLANet, 38 categories are greater than 90%. As the training ratio increases, classification accuracy of most categories significantly improves. Although the accuracy of the church (72% with 10% ratio, 75% with 20% ratio) and palace (68%, 78%) categories improve, the accuracy is still significantly lower than other categories. This is because the two categories have similar architectural styles and layouts that can easily be misclassified. Table 6. Overall accuracy of the state-of-the-art methods on NWPU-RESISC45 dataset. The highest accuracy for each ratio is bolded.

Method
Published Year Training Ratio  Figure 16. Confusion matrix using a training ratio of 10% on the NWPU-RESISC45 dataset.

Discussion
Currently, most remote sensing image scene classification methods only take classification accuracy into account, and seldom consider parameters, calculations and other issues. From our experimental results on three datasets (Tables 3, 5, and 6), the majority of methods use deep CNNs such as VGG16 [56], GoogLeNet [58], and ResNet [59]. Although deep CNNs show strong generalization for image classification and object detection, they have clear disadvantages such as too many parameters and significant computation time for training and predictive processes. For example, the model size of VGG16 exceeds 512MB, and the number of parameters exceeds 134 million. These directly affect the training and prediction time. More importantly, deep CNNs are easy to over fit with little sample data. Moreover, such deep CNNs can only be trained and implemented on hardware devices with high computational performance; this is not always conducive to practical application and deployment. Although CNNs such as GoogLeNet and ResNet have significantly reduced parameters and computational cost compared with VGG16, their parameters and computation are still significant and not suitable for mobile or other edge devices. In this case, the model with fewer parameters and less computation is more suitable for practical application, especially for real-time classification and object detection. In other words, the deployment of efficient and lightweight CNNs don't not require high-end equipment and can also achieve better performance. For the task of remote sensing image classification, we may need to deploy the model on UAV (unmanned aerial vehicle), small satellite and other devices in the future to achieve real-time classification. Therefore, the model we designed not only needs to have outstanding performance, but also needs to focus on faster speed and less computation. That's the advantage of our approach using MobileNet.
Comparison of different state-of-the-art methods on overall accuracy, parameters, calculation and model size are shown in Table 7. Many algorithms were improved on VGG16 or ResNet such as SF-CNN [44], DML [49], etc. As can be seen from the Table 7, VGG16 has the most parameters and calculations. SF-CNN replaced the last two fully connected layers with convolutional layers and

Discussion
Currently, most remote sensing image scene classification methods only take classification accuracy into account, and seldom consider parameters, calculations and other issues. From our experimental results on three datasets (Tables 3, 5 and 6), the majority of methods use deep CNNs such as VGG16 [56], GoogLeNet [58], and ResNet [59]. Although deep CNNs show strong generalization for image classification and object detection, they have clear disadvantages such as too many parameters and significant computation time for training and predictive processes. For example, the model size of VGG16 exceeds 512 MB, and the number of parameters exceeds 134 million. These directly affect the training and prediction time. More importantly, deep CNNs are easy to over fit with little sample data. Moreover, such deep CNNs can only be trained and implemented on hardware devices with high computational performance; this is not always conducive to practical application and deployment. Although CNNs such as GoogLeNet and ResNet have significantly reduced parameters and computational cost compared with VGG16, their parameters and computation are still significant and not suitable for mobile or other edge devices. In this case, the model with fewer parameters and less computation is more suitable for practical application, especially for real-time classification and object detection. In other words, the deployment of efficient and lightweight CNNs don't not require high-end equipment and can also achieve better performance. For the task of remote sensing image classification, we may need to deploy the model on UAV (unmanned aerial vehicle), small satellite and other devices in the future to achieve real-time classification. Therefore, the model we designed not only needs to have outstanding performance, but also needs to focus on faster speed and less computation. That's the advantage of our approach using MobileNet.
Comparison of different state-of-the-art methods on overall accuracy, parameters, calculation and model size are shown in Table 7. Many algorithms were improved on VGG16 or ResNet such as SF-CNN [44], DML [49], etc. As can be seen from the Table 7, VGG16 has the most parameters and calculations. SF-CNN replaced the last two fully connected layers with convolutional layers and adopted global mean pooling in the classification layer in VGG16. SF-CNN reduced the number of parameters but did not fundamentally reduce the calculations compared to VGG16. DML [49] did not change the structure of VGG16 but adopted mean center loss. Although it improved the accuracy of the original VGG16 in remote sensing image scene classification, it did not change the structure of the VGG16, and the number of parameters and calculations did not decrease. SAL-TS-Net [8] merged the features from two GoogLeNet networks in parallel. Compared with VGG16, the parameters and computational cost are reduced, but the precision is lower than directly fine-tuning VGG16. The BiMobileNet we designed has the least parameter and computation but achieved higher accuracies. Different channel reduction factors λ are set in BiMobileNet to further reduce model parameters and calculations. When λ is 0.75 and k is 1, the overall accuracy of BiMobileNet is higher than that of most methods. The number of parameters is approximately 1/11, 1/85, and 1/6 that of SF-CNN, DML and SAL-TS-Net, respectively. The calculation is approximately 1/65, 1/65 and 1/6 that of SF-CNN, DML, and SAL-TS-Net, respectively. Compared with other state-of-the-art methods, obviously BiMobileNet not only has outstanding performance, but also significantly reduces parameters and calculational cost. 1 GFLOPs = 10 9 floating-point operations; k is the kernel size in bilinear pooling layer.

Conclusions
This study introduces the idea of a bilinear model in fine-grained image classification into the remote sensing image scene classification task. Based on MobileNetv2, a highly efficient lightweight convolutional neural network (CNN) for remote sensing image scene classification is proposed -BiMobileNet. MobileNetv2 has the advantages of fewer parameters and a smaller number of calculations, but its remote sensing image classification performance is generally weaker than deep CNNs. MobileNetv2 s backbone network is used to extract the features of the images, with the features bilinearly pooled to increase intra-class consistency and inter-class distinction which can significantly improve the accuracy of scene classification and be applied to any remote sensing classification task. By training and testing on three widely used large-scale remote sensing image datasets, both the accuracy and complexity of the model were evaluated with the following conclusions drawn: 1 The accuracy of BiMobileNet in remote sensing image scene classification surpasses most state-of-the-art methods, particularly with little training data. 2 BiMobileNet requires fewer parameters and calculations making training and prediction faster and more efficient. 3 The challenges of remote sensing image scene classification are intra-class inconsistency and inter-class indistinction. The method of using bilinear pooling overcomes some of the difficulties of scene classification providing a simple and efficient method for scene classification.
In addition, compared with the ImageNet dataset (with 1000 categories), current remote sensing image datasets have far fewer categories (NWPU-RESISC45 dataset has 45 categories, the AID dataset has 30 categories, and the UC Merced dataset has 21 categories). Image categories are more diverse and complex than this, limiting practical applications. However, the use of the lightweight and efficient CNN described in this study will aid faster and more accurate classification of remote sensing images.