Deep Transfer Learning for Land Use and Land Cover Classification: A Comparative Study

Efficiently implementing remote sensing image classification with high spatial resolution imagery can provide significant value in land use and land cover (LULC) classification. The new advances in remote sensing and deep learning technologies have facilitated the extraction of spatiotemporal information for LULC classification. Moreover, diverse disciplines of science, including remote sensing, have utilised tremendous improvements in image classification involving convolutional neural networks (CNNs) with transfer learning. In this study, instead of training CNNs from scratch, the transfer learning was applied to fine-tune pre-trained networks Visual Geometry Group (VGG16) and Wide Residual Networks (WRNs), by replacing the final layers with additional layers, for LULC classification using the red–green–blue version of the EuroSAT dataset. Moreover, the performance and computational time are compared and optimised with techniques such as early stopping, gradient clipping, adaptive learning rates, and data augmentation. The proposed approaches have addressed the limited-data problem, and very good accuracies were achieved. The results show that the proposed method based on WRNs outperformed the previous best results in terms of computational efficiency and accuracy, by achieving 99.17%.


Introduction
There have been rapid advancements in remote sensing technologies, satellite image acquisitions, production of unprecedented sources of information, and increased access to data availability, allowing us to understand the features of earth more comprehensively, encouraging innovation and entrepreneurship.The enhanced ability to observe the earth from low orbit and geostationary satellites [1] and better spatial resolution for remote sensing data [2] have led to the development of novel approaches for remote sensing image analysis, facilitating extensive ground surface studies.Scene classification that is aimed at labelling an image according to a set of semantic categories [3] is eminent in the remote sensing field due to its extensive applications including Land Use and Land Cover (LULC) [4,5] and land resource management [2].
The recent years have witnessed great advances in LULC classification in tasks like denoising, cloud shadow masking, segmentation, classification, and others [6][7][8][9].Extensive algorithms have been devised with a concrete theoretical basis, exploiting the spectral and spatial properties of pixels.However, with an increase in the level of abstraction from pixels to objects to scenes, and complex spatial distributions of diverse land cover types, classification continues to be a challenging task [10].Object or pixel-based [11][12][13] approaches possessing low-level features encoding spectral, textural, and geometric properties become incompetent to capture the semantics of the scene.Hu et al. [14] deduced that more representative and high-level features, which are the abstractions of low-level features are necessary for scene classification.Currently, Convolutional Neural Networks (CNNs) are the dominant methods in image classification, detection and segmentation tasks because of their ability to extract high-level feature representations to describe scene images [15].
Hu et al. [14] observed that in spite of CNNs' fine capability to extract the high-level and low-level features, it is tedious to train the CNNs with smaller datasets.Whereas, Yin et al. [16] and Yosinski et al. [17] observed that the features learned by the layers from different datasets show common behaviour.Convolution operators from the initial layers learn the general characteristics and towards the final layers, there is a transition to features more specific to the dataset on which the model is trained.These general and specific CNN layer feature transitions have led to the development of transfer learning [18,19].As a result, the features learnt by the CNN model on a primary job were employed for an unrelated secondary task in transfer learning.The primary model acts as a starting point or as a feature extractor for the secondary model.The contributions made in this article are listed below.

•
LULC classification is performed using two transfer learning architectures, namely the Visual Geometry Group (VGG16) and Wide Residual Networks-50 (ResNet-50), on the Red-Green-Blue (RGB) version of the EuroSAT dataset.

•
The performance of the methods are empirically evaluated with and without data augmentation.

•
The model performance and computation efficiency are improved with model enhancement techniques.

•
The RGB version of the EuroSAT dataset is benchmarked.
The rest of the paper is organized as follows.First, the related works are presented in Section 2. In Section 3, the dataset used herein is described, and the methodologies of the modified VGG16 and Wide ResNet-50 are presented.The results and analyses are demonstrated in Section 4. A discussion is made in the light of other studies in Section 5, and finally, the paper is concluded in Section 6.

Related Works
This section mainly presents the recent studies in remote sensing scene classification using Deep Learning (DL) and Transfer Learning (TL).Furthermore, it presents the state-of-the-art image classification methods for LULC on the EuroSAT dataset.
Xu et al. [20] used Principal Component Analysis (PCA) to reduce data redundancy, then trained a self-organizing network to classify Landsat Satellite images which outperformed the maximum likelihood method.Later, Chen et al. [21] showed the potential of DL on hyperspectral data classification with a hybrid framework which includes DL, logistic regression, and PCA [22].Stacked autoencoders were used in DL frameworks to extract high-level features.Basu et al. [5] and Zou et al. [15] used deep belief networks for remote sensing image classification and experimentally demonstrated the effectiveness of the model.Piramanayagam et al. [22] and Liu et al. [23] demonstrated the potential of CNNs for LULC classification, where they actively selected training samples at each iteration with DL for a better performance.The scarcity of labelled data was tackled by implementing data augmentation techniques [24].Furthermore, Yang et al. [25] improved the generalisation capability and performance by combining deep CNN and multi-scale feature fusion against the limited data.Liu et al. [26] also proposed a scene classification method based on a deep random-scale stretched CNN.Another constraint with remote sensing images was the presence of scenic variability which limited the classification performance.As a work-around, Saliency Dual Attention Residual Network (SDAResNet) was studied in [27] containing both spatial and channel attention, leading to a better performance.Later, Xu et al. [28] came up with an enhanced classification method involving the Recurrent Neural Network along with Random Forest for LULC.Another approach with an attention mechanism was studied by Alhichri et al. [29] based on the pre-trained EfficientNet-B3 CNN.They tested it on six popular LULC datasets and demonstrated the capability in remote sensing scene classification tasks.Liang et al. [30] and Pires de Lima and Marfurt [31] proposed specific fine-tuning strategies which were better than CNN for aerial image classification.Kwon et al. [32] proposed a robust classification score method for detecting adversarial examples in deep neural networks that does not invoke any additional process, such as changing the classifier or modifying input data.Bahri et al. [33] experimented with a TL technique that outperformed all the existing baseline models by using Neural Architecture Search Network Mobile (NASNet Mobile) as a feature descriptor and also introduced a loss function that contributed to the performance.
In the context of LULC classification (Table 1) on the EuroSAT dataset, Helber et al. [34], the creators, used GoogleNet and ResNet-50 architectures with different band combinations.They found that the ResNet-50 with the RGB bands achieved the best accuracy compared to GoogleNet with the RGB bands and ResNet-50 with a Short-Wave Infrared (SWIR) and a Color-Infrared (CI) combination.The Deep Discriminative Representation Learning with Attention Map (DDRL-AM) method, proposed by Li et al. [38], obtained the highest accuracy of 98.74% using the RGB bands compared to other results listed in Table 1.Finally, Yassine et al. [39] tried out two approaches for improving the accuracy of the EuroSAT dataset.In the first approach, the 13 spectral bands of Sentinel-2 were used for feature extraction, producing 98.78% accuracy.In the second approach, 13 spectral feature bands of Sentinel-2 along with the calculated indices, such as Vegetation Index based on Red Edge (VIRE), Normalized Near-Infrared (NNIR), and Blue Ratio (BR) were used for feature extraction, resulting in an accuracy of 99.58%.

Materials and Methods
Herein, TL is used to carry out the LULC classification.In past experiments, several architectures have been proposed and tested for scene classification [22][23][24].After experimenting and comparing different pre-trained architectures [25][26][27][28], VGG16 and Wide ResNet-50 are employed for the particular use-case.The models are fine-tuned on the RGB version of the EuroSAT dataset and trained using the PyTorch framework, in the Python language.NVIDIA TESLA P100 GPUs available with Kaggle are used for model training and testing.

Dataset
The EuroSAT dataset is considered a novel dataset based on the multispectral image data provided by the Sentinel-2 satellite.It has 13 spectral bands consisting of 27000 labelled and georeferenced images (2000-3000 images per class) categorised into 10 different scene classes.The image patches contain 64×64 pixels with a spatial resolution of 10m. Figure 1 demonstrates some sample images from the EuroSAT dataset [34].

Transfer Learning Methods
VGG16, very deep convolutional networks, has shown that the representation depth is beneficial for the classification accuracy [41].The pre-trained VGG model is trained on the ImageNet dataset having 1000 classes, with the convolutional block possessing multiple convolutional layers.The top layers learn low-level features and the bottom layers learn high-level features of the images.
ResNet can be viewed as an ensemble of many smaller networks and has commendable performance for image recognition tasks [42][43][44].The performance degradation problem [45] caused by adding more layers to sufficiently deep networks was tackled by ResNet via introducing Identity Shortcut Connection [46].The Wide Residual Networks are an improvement over the Residual Networks.They possess more channels with increased width and decreased depth when compared to the Residual Networks [47].
In this research, the pre-trained models of VGG16 and Wide ResNet-50 are used.The VGG16 and Wide ResNet-50 pre-trained models expect input images normalised in minibatches of 3-channel RGB images of shape (3×H×W), where H and W are expected to be 224.Final classification layers are replaced with fully connected and dropout layers, see Figure 2. ReLU and log-softmax activation functions are also used.The initial layers from training are frozen and the modified layer is fine-tuned with the EuroSAT dataset.The model is trained for 25 epochs with a batch size of 64.Adam [48] is used as the model optimizer with categorical cross-entropy loss for loss calculation.To enhance the model's efficiency in terms of computation time and performance, model enhancement techniques like gradient clipping, early stopping, data augmentation, and adaptive learning rates are used.[49] enhance the diversity of the data to some extent by introducing visual variability, which helps the model to interpret the information with more accuracy.For the EuroSAT dataset, the data augmentation techniques used herein are Gaussian Blurring, Horizontal Flip, Vertical Flip, Rotation and Resizing.There are many data augmentation techniques available but due to the inherent uniformity in the EuroSAT dataset, most of the data augmentation techniques did not have a significant impact.

Gradient Clipping
Gradient clipping [50] can prevent vanishing and exploding gradient issues that mess up the parameters during training.In order to match the norm, a predefined gradient threshold is defined.Gradient norms that surpass the threshold are reduced to match the norm.The norm is calculated over all the gradients collectively, and the maximum norm is 0.1.

Early Stopping
Early stopping is a regularisation technique for deep neural networks which stops the training after an arbitrary number of epochs once the model performance stops improving on a held-out validation dataset.In essence, throughout training, the best model weights are saved and updated.When parameter changes no longer provide an improvement (after a certain number of iterations), training is terminated and the last best parameters are utilised (Figure 3).This process reduced overfitting and enhanced the generalisation capability of deep neural networks.

Learning rate optimisation
The learning rate is a hyperparameter that controls how much the model weights are updated in response to the anticipated error in each iteration.Choosing the learning rate may be difficult since a value too small can lead to a lengthy training procedure with significant training error, while a value too big can lead to learning a sub-optimal set of weights too quickly (without reaching the local minima) or an unstable training process [51].The reduce learning rate, ReduceLROnPlateau is used herein [52].When learning becomes static, models frequently benefit from reducing the learning rate by a factor of 2-10.The learning rate is lowered by a factor of 0.1 with patience (number of epochs with no improvement) as 2. Adam is used as the optimizer maximum learning rate as 0.0001.

Results
In this section, the results are separately demonstrated for the two different transfer learning approaches employed for the study.For training each model, all the hyperparameters have been finalised by preliminary experiments.The models have been trained with a 75/25 split for training and testing, respectively.In other words, the models were trained on random 75% of data and tested on the other 25%.Similarly, five different such sets were used for evaluation.Data augmentation is implemented to increase the effective training set size.

VGG16 -Visual Geometry Group Network
The EuroSAT dataset on VGG16 architecture was fine-tuned by freezing the top layers and training only the added classification layers (Figure 2a) with different hyperparameters.The pre-trained weights gave the advantage of the learnings that they have achieved on the ImageNet dataset.
While training Without Data Augmentation (WDA), a validation accuracy of 98.14% was achieved; whereas, training with data augmentation resulted in better accuracy of 98.55% (Table 2).The early stopping method was used with patience of 5 and saved the best model with maximum validation accuracy.This approach helped in preventing the overfitting of the model and saved computational time.Due to early stopping, the training stopped at the 21st epoch (18th -WDA), where the total number of epochs was 25.It took 2h 4min 12s for training 21 epochs which means approximately 6.1 minutes for each epoch.But without data augmentation, it took 1h 47min 24s for training 18 epochs, which means approximately 5.9 minutes for each epoch (Table 2).

Wide ResNet-50 -Wide Residual Network
In the first approach of training WDA, the model was able to achieve a validation accuracy of 99.04% which was outperformed by the approach with data augmentation with an accuracy of 99.17% (Table 2).Hence, the model with the best performance was considered.With early stopping, the training stopped at the 23rd epoch (total 25 epochs) whereas WDA training stopped at the 14th epoch.The best model took 2h 7min 53s to run 23 epochs with 5.6min per epoch, which was better than the VGG16 (Table 2).
The loss and accuracy graphs show steady learning in the first epoch (Figure 5).Towards the 15th epoch, the learning shows almost a linear relationship with some instability in between.Furthermore, between the 15th and 23rd epochs, a delayed and small learning has been achieved because of the updation of the learning rate to smaller optimum values to calculate the best result.The learning rate has changed thrice in the entire training period.Figure 6a demonstrates the confusion matrix of VGG16, based on validation data, which shows the class-wise performance of the model.The Forest, Highway, Residential, and Sea/Lake classes show the maximum performance above 99% accuracy; whereas, permanent crop, herbaceous vegetation, and pasture seem to have the least accuracy.Annual crop, permanent crop, pasture, and herbaceous vegetation get misclassified because of the similarity in topological features.By analysing the images of these classes, it is understood that they share common features that might confuse the model to classify correctly.
Figure 6b shows the confusion matrix for Wide ResNet-50.The Forest and Sea/Lake classes are the most accurate with an accuracy of 99.86%.The class permanent crop shows the least accuracy of 97.39%.There is an improvement in accuracy and reduced misclassifications of all classes except River.Figure 7 demonstrates some of the correct predictions using Wide Resnet-50 and also shows a River scene that is incorrectly predicted as Highway (see the middle panel).The modified VGG16 is also incorrectly predicted this River scene as Highway and predicted the permanent crop scene, shown in the top-middle panel in Figure 7, as Herbaceous Vegetation.Note that VGG16 also predicts these scenes the same as Wide ResNet-50 but incorrectly predicts the top-middle scene as Herbaceous Vegetation.

Discussion
In this study, the challenge of LULC classification was addressed using deep transfer learning techniques.For this task, two prominent transfer learning architectures, namely, VGG16 and Wide ResNet-50, with the EuroSAT dataset were used.Focusing on the LULC classification of the RGB bands of the EuroSAT dataset, a state-of-the-art accuracy of 99.17% was achieved by using the Wide ResNet-50.
Experimentally, the best fine-tuning parameters were found for VGG16 and Wide ResNet-50 with RGB bands of the EuroSAT dataset.The parameters that contributed to the best performance were used to create the final models.The models were compared with and without data augmentation.Both of these architectures were compared based on their computational training time, the number of epochs trained, and test data accuracy (Table 2).From the results, it was observed that Wide ResNet-50 architecture was computationally more feasible as the time taken for each epoch to train was less than VGG16, even though the former was a deeper network.
The number of epochs trained was less without data augmentation due to early stopping and limited data.The model converged early, not have much improvement, hence consuming a shorter training time.In contrast, more epochs were trained with data augmentation because it generated more data for the model to learn the features from, which provided better generalisation and ultimately led to a better accuracy.With more high-resolution data, the architecture proposed herein can create and learn more adversarial examples [53] and make better predictions.
From the confusion matrix shown in Figure 6b, one can see that the Forest class, followed by the Sea/Lake class, was the best as it was hardly misclassified.Similarly, due to similar topological features, Herbaceous Vegetation, Annual Crop, Pasture, and Permanent Crop were confused.The Highway class was misclassified as the River class because of a similar linear appearance.A similar trend was observed in the VGG16 confusion matrix (Figure 6a).The presence of clear and distinct topological features for the Forest and Sea/Lake classes, i.e., majority of green and blue cover for both the images led to accurate results.Similarly, Pasture, Herbaceous Vegetation, Annual Crops were misclassified to higher degrees.Again, the Highway and River classifications were also confused because of similar topological features.Thus, from these trends, it was concluded that the model training was mimicking human learning patterns.With the presence of more inter and intra class variability in the dataset, these faulty learning patterns could be significantly improved.Another effective approach can be incorporating the invisible bands like near-infrared into the models for distinguishing between road and river [7,39].From the feature understanding capability depicted by the confusion matrices of both the models, the learning pattern of the architectures was found to be quite comparable.The major difference lied only on how well the model was understanding everything, i.e., the classification accuracy.
In this research, the performance of Wide ResNet-50 and VGG16 with multiple validation dataset was intensively compared.The prediction of Wide ResNet-50 on the EuroSAT dataset was found better than VGG16 by at least 0.6% of the total validation dataset.As mentioned in Table 2, the best performing model of Wide ResNet-50 was 99.17%, while it was 98.55% for VGG16.Thus, it was understood that Wide ResNet-50 performed better than VGG16.From Table 1, the achieved accuracy of 99.17% using Wide ResNet-50 with the RGB bands is higher than the highest achieved accuracy of 98.74% using the DDRL-AM model with RGB bands.

Conclusions
The objective of this article was to investigate how the transfer learning architectures for LULC classification perform.The study was based on two potential architectures, namely, VGG16 and Wide ResNet-50, fine-tuned with RGB bands of the EuroSAT dataset for the classification.Much like the findings in other experiments, it was found that the transfer learning is a quite reliable approach that can produce the best overall results.The proposed methodology improved the state-of-the-art and provided a benchmark with an accuracy of 99.17% for the RGB bands of the EuroSAT dataset.
The classification results prior to and after data augmentation were compared.Data augmentation techniques elevated the diversification of the dataset as it only increased the visual variability of each training image without generating any new spectral or topological information.Evidently, the experimental results with data augmentation outperformed those from the same model architecture trained on the original dataset.Model enhancement techniques like regularisation, early stopping, gradient clipping, learning rate optimisation, and others were implemented to make the model training more efficient, improve the performance and ultimately reduce the computational time required.The Wide ResNet-50 architecture was found to generate better results than VGG16, while the same data augmentation approaches were applied to both.Even though Wide ResNet-50 produced better results, the learning pattern of the models resembled, where the only difference was found in the accuracy of the class predictability.
This problem may be solved by supplementing the quality and quantity of data.The generation of datasets with higher inter and intra class variability, supported by robust deep learning architectures with data augmentation techniques, could effectively increase the representational power of the deep learning network.Thus, the proposed methodology is an effective exploitation of the satellite datasets available and deep learning approaches to achieve the best performance.The applications can be extended to multiple real-world earth observation applications for remote sensing scene analysis.
Supplementary Materials: Data associated with this research are available online.The Eu-roSAT dataset is freely available for download [40].The Jupyter Notebooks used for the training the image classifier is available for download at https://github.com/raoofnaushad/EuroSAT_LULC.

Figure 1 .
Figure 1.EuroSAT dataset sample images.The available classes are Forest, Annual Crop, Highway, Herbaceous Vegetation, Pasture, Residential, River, Industrial, Permanent Crop, and Sea/Lake.The RGB version of the EuroSAT dataset is used for training in this study.The labelled EuroSAT dataset is made publicly available [40].The dataset is split into 75/25 ratios for training (20250 images) and validation (6750 images), respectively.Mini-batches of 64 images are used for training purposes.

Figure 2 .
Figure 2. Model architectures: (a) Modified VGG16 architecture with training and freezing layers, and (b) Wide ResNet-50 architecture with training and freezing layers.

3. 3 .
Model Performance Enhancement Methods 3.3.1.Data Augmentation The diversity and volume of training data play an eminent role in training a robust DL model.Basic data augmentation techniques

Figure 3 .
Figure 3. Early stopping: training is stopped as soon as the performance on the validation loss stop decreasing even though the training loss decreases.

Figure 4
Figure4shows the training and validation loss and accuracy diagrams.It can be seen that in the first epoch, both the loss and accuracy have improved exponentially and then shown a linear relation from the 2-10 epochs.During this period, some instability in learning was observed and towards the end, no significant improvement was noticed.Since an adaptive learning rate with ReduceLROnPlateau was used herein, the learning rate has updated thrice during the training, which certainly helped the model to achieve the optimum result.

Figure 4 .
Figure 4.The VGG16 results representing the history of training and validation (a) loss and (b) accuracy across the epochs.

Figure 5 .
Figure 5.The Wide ResNet-50 results representing the history of training and validation (a) loss and (b) accuracy across the epochs.

Figure 6 .
Figure 6.The confusion matrices for the (a) VGG16 and (b) Wide ResNet-50 architectures applied to the EuroSAT dataset.

Figure 7 .
Figure 7.The Wide ResNet-50 sample results.It shows the actual and predicted values of sample inputs from the test dataset.Note that VGG16 also predicts these scenes the same as Wide ResNet-50 but incorrectly predicts the top-middle scene as Herbaceous Vegetation.

Table 1 .
Comparative analysis of studies for LULC classification with the EuroSAT dataset.

Table 2 .
Comparative experimental results of VGG16 and Wide ResNet-50 with and without data augmentation