MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images

Zhang, Yonghong; Zhao, Huajun; Ma, Guangyi; Xie, Donglin; Geng, Sutong; Lu, Huanyu; Tian, Wei; Lim Kam Sian, Kenny Thiam Choy

doi:10.3390/ijgi12050206

Open AccessArticle

MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images

by

Yonghong Zhang

^1,*

,

Huajun Zhao

¹,

Guangyi Ma

²,

Donglin Xie

¹

,

Sutong Geng

¹

,

Huanyu Lu

¹

,

Wei Tian

³

and

Kenny Thiam Choy Lim Kam Sian

⁴

¹

School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁴

School of Atmospheric Science and Remote Sensing, Wuxi University, Wuxi 214105, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2023, 12(5), 206; https://doi.org/10.3390/ijgi12050206

Submission received: 24 February 2023 / Revised: 16 May 2023 / Accepted: 19 May 2023 / Published: 20 May 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The classification of land use information is important for land resource management. With the purpose of extracting precise spatial information, we present a novel land use classification model based on a mixed attention module and adjustable feature enhancement layer (MAAFEU-net). Our unique design, the mixed attention module, allows the model to concentrate on target-specific discriminative features and capture class-related features within different land use types. In addition, an adjustable feature enhancement layer is proposed to further enhance the classification ability of similar types. We assess the performance of this model using the publicly available GID dataset and the self-built Gwadar dataset. Six semantic segmentation deep networks are used for comparison. The experimental results show that the F1 score of MAAFEU-net is 2.16% and 2.3% higher than the next model and that MIoU is 3.15% and 3.62% higher than the next model. The results of the ablation experiments show that the mixed attention module improves the MIoU by 5.83% and the addition of the adjustable feature enhancement layer can further improve it by 5.58%. Both structures effectively improve the accuracy of the overall land use classification. The validation results show that MAAFEU-net can obtain land use classification images with high precision.

Keywords:

remote sensing; land use classification; mixed attention module; adjustable feature enhancement layer

1. Introduction

Land use information is the basis for understanding dynamic changes and socio-ecological interactions on the Earth’s surface [1]. It is an indispensable presence in many areas of Earth-based observation, such as urban and regional planning [2,3], environmental vulnerability and impact assessment [4,5,6], and natural disaster and hazard monitoring [7,8,9,10,11].

Traditional land use classification methods include land mapping and remote sensing mapping. Although land mapping is accurate and can combine information at different scales, mapping large areas involves a lot of manpower, time and money [12], and is gradually being replaced by the fast-developing remote sensing mapping technology. Remote sensing images can portray the general situation of the study area cheaply and conveniently. High-resolution remote sensing images can clearly display the spatial structure and surface texture characteristics of ground objects, distinguish the more detailed internal composition of ground objects and make the edge information of ground objects clearer, which provides conditions and a basis for effective land use classification. However, the accuracy of remote sensing mapping is low because it mainly relies on visual discrimination and is easily affected by subjective judgment. Zomer et al. [13] combined land mapping and remote sensing mapping to establish an effective spectral library of wetlands by analyzing the spectral maps obtained in the field with the spectral data acquired by remote sensing, which helps in land mapping and conservation of wetlands.

The rapid improvement in computer technology and remote sensing image resolution has led to the application of machine learning methods such as artificial neural networks (ANN), support vector machines (SVM), cluster analysis, Markov chain and fuzzy adaptive resonance theory-supervised predictive mapping to land classification. In recent years, a lot of research has been conducted using machine learning methods in land use and land cover classification. Talukdar et al. [14] used six machine learning algorithms: random forest (RF), SVM, ANN, fuzzy adaptive resonance theory-supervised predictive mapping and spectral angle mapper. Abdi et al. [15] used Sentinel-2 images and combined classical machine learning and deep learning methods in a boreal landscape to obtain an accurate land use classification. Wang et al. [16] compared machine learning algorithms with cellular automata (CA) and other methods, and reported that machine learning methods still have development potential in land use classification. Zhang et al. [17] predicted future ground cover and surface temperature changes in Wuhan using CA and ANN. Hao et al. [18] conducted a year-by-year mapping of land cover in Mongolia for 1990–2021 based on Landsat TM/OLI and RF methods to analyze land use change patterns and their drivers. Carlier et al. [19] used a polythetic hierarchal agglomerative cluster analysis to depict a landscape classification map of the Republic of Ireland. Kaczmarek et al. [20] used ANN for textual analysis of land use planning to better integrate spatial development planning. Zhang et al. [21] used ANN-CA and a long short-term memory model of the improved whale optimization algorithm (IWOA-LSTM) to predict land use and surface temperature changes in Wuhan, China. Rahnama et al. [22] used CA and the Markov chain to predict land use changes from 2016 to 2030 using Sentinel-2 satellite images of the Mashhad Metropolitan area. Sobhani et al. [23] monitored land use changes in two protected areas of Tehran, Iran using CA and the Markov chain with Landsat images from 1989–2019.

Although classical machine learning models can reduce costs to a certain extent, the results are too vague and rough compared to land mapping and do not meet the requirements of today’s high-precision data. With the gradual development of artificial intelligence technology, semantic segmentation techniques have gradually replaced previous machine learning methods as the mainstream of land use classification. Image semantic segmentation is the process of subdividing a digital image into multiple super pixels. The purpose of image semantic segmentation is to simplify or change the representation of the image and make it simpler and easier to understand and analyze [24]. Image semantic segmentation can be divided into traditional semantic segmentation and semantic segmentation by deep networks. Traditional semantic segmentation techniques include threshold-based image semantic segmentation, edge-based image semantic slicing, region-based image semantic slicing and theory-specific image semantic segmentation [25]. For example, Yu et al. [26] used normalized cut and color-based segmentation to produce an automatic image captioning system. Wang et al. [27] improved the grab cut for isolated lung nodule segmentation in X-ray images.

Traditional semantic segmentation methods are based on low-level semantic information for image processing, which can only utilize shallow feature information, lack contextual links and information interaction between pixels, and cannot take into account all land types for overall classification. Therefore, the land use classification model should be able to extract high-level semantic information and use both spatial and global features of images. In recent years, deep network semantic segmentation applications for land use classification have emerged with greater learning capability and more comprehensive feature extraction. They mainly include SegNet [28], which uses a pooled indexing method to save memory space and allows input of larger images for segmentation; pyramid scene parsing network (PSPNet) [29], which proposes a spatial pyramid pooling (SPP) structure to increase the perceptual field, allowing the network to better integrate global information; U-net [30], whose U-shaped structure with jump connections can make good use of the semantic information at different levels of the image; Deeplabv3+ [31], which uses a decoder-encoder structure based on the atrous spatial pyramid pooling (ASPP) structure, strengthens the ability to characterize features and a large amount of shallow information retention makes it better for identifying the edges of segmented objects; MANet [32], which contains the new kernel attention module (KAM), has been shown to be as effective as other CNN attention methods through ablation experiments; and DC-Swin [33] uses the Swin Transformer as the encoder and a densely connected feature aggregation module (DCFAM) as the decoder to extract multi-scale relationally enhanced semantic features for accurate segmentation. Yuan et al. [34] achieved good results in land use classification of remote sensing images based on PSPNet. Zhan et al. [35] improved Deeplabv3+ to increase land use classification accuracy. In addition to deep networks composed of convolutional neural networks (CNN), a natural language processing (NLP) transformer is also applied in land classification. This type of network uses a multi-headed attention mechanism as the main construct, allowing the model to focus on information from different locations and ultimately obtain features that are richer in types of ground objects. For example, Hu et al. [36] used a parallel transformer structure that outperformed other comparative models on hyperspectral and radar data. Li et al. [37] proposed a CNN-transformer combination for crop classification that achieved significant performance improvements compared to other models.

Currently, many studies have demonstrated the advantages of deep networks over traditional land classification methods. Sun et al. [38] combined multi-filtering and multi-resolution segmentation to design a network for experiments on radar data and high-resolution aerial imagery. The results showed that deep networks do improve the results for land classification. Wang et al. [39] used 0.5 m resolution aerial remote sensing images of Zhejiang province to build a dataset, designed a multi-scale CNN and compared it with the traditional SVM method. The experimental results showed that the multi-scale network has higher accuracy and better overall classification effect than the traditional method, demonstrating the advantages of deep networks. Shi et al. [40] used a deep neural network combining supervised and unsupervised methods to monitor land use and environmental changes. The experimental results on multi-temporal images showed that the method could effectively and accurately detect and analyze changes. Although deep networks are used in land use classification, few studies on land use classification with transformers exist. CNN models cannot combine overall and local information well, and cannot adequately connect and distinguish different land use type characteristics, resulting in too smooth segmentation results and insufficient improvement of overall classification accuracy.

To address the above issues, this paper proposes a novel land use classification model with a mixed attention module and adjustable feature enhancement layer. The model first reduces feature loss by incorporating the Focus structure of ResNet50 [41] feature extraction. Afterward, the convolutional attention mechanism and transformer are combined to effectively obtain the context relationship between pixels and improve the recognition ability of different feature types. Finally, in the upsampling stage, our model uses a feature enhancement layer to enhance feature recognition and learns to assign the proportion of different feature maps to the segmentation results, reducing the impact of similar features on the classification and improving the accuracy of the overall land use classification. The main contributions of this paper can be summarized in the following four points.

1. In this paper, we design a mixed attention (MA) module that combines the convolutional block attention module (CBAM) and Swin Transformer blocks, which enables the model to extract features of different land use types in the image in a more targeted manner, greatly improves the quality of the subsequent upsampling and feature enhancement and effectively increases the accuracy of land use classification.

2. This paper proposes a feature enhancement layer that uses adjustable parameters and multi-branch convolution on the input feature maps to enhance the features of land classification types and alleviate the problem of incorrect segmentation caused by similar features of similar types.

3. A Gwadar dataset for land use classification is built in this paper and can be viewed and downloaded from: https://gas.graviti.com/dataset/xvxibu/GwadarDataset (accessed on 7 December 2022).

4. The model in this paper achieves the best performance on both the GID dataset and the Gwadar dataset compared to other segmentation networks. The F1 score is 76.38% and MIoU is 61.80% on the GID dataset, and the F1 score is 93.69% and MIoU is 88.50% on the Gwadar datasets.

The rest of the paper is structured as follows. The overall structure of the model and details of the modules are presented in Section 2. The dataset and results of the experiments are presented in Section 3 and, finally, a summary is given in Section 4.

2. Methods

2.1. General Framework of the Model

In this paper, two different attention mechanism modules, the Swin Transformer [42] and CBAM [43], are used to build the MA module, which fully exploits the distinct advantages of the two modules for feature extraction and strengthens the connection between global and local features. As shown in Figure 1, the MAAFEU-net consists of three main components: the backbone network with the addition of Focus, the MA module to enhance feature extraction and the feature enhancement layer to further process the upsampled feature maps. Firstly, considering that the backbone network ResNet50 loses more feature information for the downsampling of the original image at the beginning, Focus is used instead of the convolution operation to reduce the information loss. Secondly, due to the complex land use types and irregular spatial distribution in remote sensing images, MAAFEU-net incorporates the MA module to improve the feature extraction capability. The MA module can use the CBAM module to add overall feature information to the extracted local feature map by convolution and Swin Transformer to obtain the connection between different types of ground objects. Combined with the cross stage partial layer (CSPlayer), it reduces the possibility of duplication in the information integration process and improves the quality of the feature map from multiple perspectives for subsequent upsampling and feature enhancement. Finally, in order to reduce the influence of similar features on the classification, the upsampled image and the feature map from Focus are fed into the feature enhancement layer, and the scale of the input feature map is changed by adjustable parameters. Then, the features of different types of ground objects are further enhanced by depthwise separable convolution and adaptive pooling, and the segmentation results are output.

2.2. Main Components of the Model

2.2.1. Focus

The Focus structure was first used in Yolov5 [44] to slice the image before it enters the backbone network. The input to the Focus structure is a 3-channel RGB image of size H

\times

W. On each channel, a value for every other pixel is obtained along the rows and columns, respectively, and a total of four independent feature layers are obtained for each channel, with complementary features and almost no loss of information between them. The original image size becomes

\frac{H}{2} \times \frac{W}{2}

and the number of channels changes from 3 to 12, after which the 3

\times

3 convolution operation changes the channels to 64. Compared to ResNet50, which uses a direct 7

\times

7 convolution kernel for initial downsampling, Focus reduces the loss of feature information during the downsampling operation, facilitating further feature extraction by the MA module later.

2.2.2. Mixed Attention Module

This paper uses the Swin Transformer as a branch of the MA module because it has the transformer ability to explore the connection between the whole image and local features, and it can also reduce the computational effort by sliding window attention, which effectively alleviates the problem of large computational requirement of a transformer. On the other hand, the attention module CBAM combines the powerful feature representation capability of CNN with the attention effect and is able to focus on important feature regions in channel and spatial dimensions, which enhances and complements the results obtained by the Swin Transformer. In order to avoid the damage of the Swin Transformer and CBAM methods for extracting features and correlation information, the final MA module combines the two attention modules in parallel and uses the CSPlayer to integrate the different attention information.

Swin Transformer Blocks

To enhance the ability of the model to capture the connection of different land use types in the feature map, Swin Transformer blocks are used as branches of the feature extraction module MA. As shown in Figure 2a, the input feature map in the first block is chunked into embedded patches of size 1 and then flattened along the channel direction. This is followed by a layernorm operation to enter the window multi-head self-attention (W-MSA) layer (Figure 2b), where the W-MSA splits the feature map into multiple non-overlapping windows and then performs a self-attention calculation inside each window. If the feature map size is given as X ∈

R^{H \times W \times C}

, and the size of the partition window is

M^{2}

, then the standard (Equation (1)) and windowed multi-headed self-attention computation complexities (Equation (2)) are

Ω (M S A) = 4 H W C^{2} + 2 {(H W)}^{2} C

(1)

Ω (W - M S A) = 4 H W C^{2} + 2 M^{2} H W C

(2)

As the window size is much smaller than the image, this reduces the amount of computation and speeds up the operation. After the fully connected layer and the dropout, the feature map goes into the second block. The only difference compared to the previous block is that the W-MSA is replaced by the shifted window multi-head self-attention (SW-MSA) (Figure 2c), which shifts the segmented windows to create a new 9-window block. In order to maintain the computational load of the original 4 windows while still being able to interact with the information between the different windows, the A, B and C windows are shifted and masked multi-head self-attention (Masked MSA) is performed, which can avoid feature extraction errors caused by the mixing of features in unrelated regions. The layernorm and dropout layers help the network to converge better and prevent the network from overfitting.

Convolutional Block Attention Module (CBAM)

For forward propagation neural networks, CBAM is a simple and effective module for convolutional attention mechanisms. It consists of two main components: the channel attention module (CAM) and the spatial attention module (SAM), as shown in Figure 3a. The model in this paper combines local and global feature information by incorporating the CBAM module to improve sensitivity to different types of ground objects and reduce the impact of larger scale types on smaller land use types.

CAM is used to focus on the useful information in the feature map and fuse this information with each other, as shown in Figure 3b. The input feature map F is passed through parallel max pooling and average pooling layers, shared MLP module, summation operation and sigmoid activation function to obtain the channel attention M_c(F). Feature map F’ with the channel attention attached is obtained by multiplying F with M_c(F) at the element level. The process of channel attention calculation is shown in Equations (3) and (4):

M_{c} (F) = σ (M L P (A v g P o o l (F))) + M L P (M a x P o o l (F))

(3)

F^{'} = M_{c} (F) \times F

(4)

With SAM, the local feature map can combine the semantic information of the context to improve the quality of the output features and facilitate the extraction of spatial location information of different types of ground objects, as shown in Figure 3c. After max pooling and average pooling of the input feature map F′, SAM obtains two 1-channel feature maps of the same size and performs a concatenation operation on the two results. Then, the feature map is transformed into a 1-channel feature map by 7

\times

7 convolution, and the combined spatial attention M_s(F′) is obtained by the sigmoid activation function. Finally, F’ is multiplied by the corresponding pixels of M_s(F′) to obtain the feature map F″, incorporating channel and spatial attention. The calculation process is shown in Equations (5) and (6):

M_{s} (F^{'}) = σ (f^{7 \times 7} ([A v g P o o l (F^{'}); M a x P o o l (F^{'})]))

(5)

F^{″} = M_{s} (F^{'}) \times F^{'}

(6)

Cross Stage Partial Layer

CSPlayer is derived from CSPnet [45]. The CSPlayer is used in this paper to allow further processing of the high-dimensional feature maps output by Swin Transformer blocks and CBAM to deepen the features. In addition, the CSPlayer is used to effectively reduce the possibility of repetition during information integration and improve the learning capability of the network. Bottleneck is a residual structure consisting of 1

\times

1, 3

\times

3 and 1

\times

1 convolution sequences. The backbone of the CSPlayer is made up of multiple bottleneck stacks. The other part of the CSPlayer is processed by 1

\times

1 convolutions and batch normalization with the features extracted from the backbone for channel dimension concatenation and outputting the results.

2.2.3. Adjustable Feature Enhancement Layer

Due to the high similarity between some types in land use classification, the features of the segmented objects must be further enhanced to get better land use classification results. First, the final Y₂ obtained from the multiple upsampling is subjected to a 1

\times

1 convolution operation to change the channels from 128 to 64 so that the Y₁ obtained by the Focus structure and Y₂ have the same dimension. Afterward, Y₁ and Y₂ are multiplied by the parameters X₁ and X₂, respectively. X₁ and X₂ are calculated using Equations (7) and (8), where a₁ and a₂ are parameters generated using Pytorch’s parameter function with an initial value of 1. They can be trained by the model so that the calculated X₁ and X₂ can be continuously adjusted. b is a constant 1 × 10⁻⁸ to avoid errors when the divisor is zero:

X_{1} = \frac{a_{1}}{a_{1} + a_{2} + b}

(7)

X_{2} = \frac{a_{2}}{a_{1} + a_{2} + b}

(8)

The adjustable parameters X₁ and X₂, ranging from 0 to 1, allow the model to better allocate the proportion of the shallow feature map Y₁, where little information is lost but deep details is not mined, and the deep feature map Y₂, obtained by upsampling, to the segmentation results.

After adding the two feature maps, a 3

\times

3 convolution is performed to integrate the features. The fused feature map goes into adaptive averaging pooling after a 1

\times

1 convolution and a 3

\times

3 deep separable convolution, and the results are summed to enhance the representation of distinct features. After that, the results of 3

\times

3 depthwise separable convolution and 1

\times

1 convolution are combined to further fuse the features. The final output segmentation result is shown in Figure 4. The depthwise separable convolution has a smaller number of parameters and computational effort than normal convolution, and it is faster. Adaptive average pooling allows the parameters to be self-calculated, depending on the input and output dimensions, without setting the convolution kernel and step size. The pooling operation can effectively improve the obvious features, and the small features lost are compensated to a certain extent by adding the results of depthwise separable convolution. Lastly, the summation with the results of ordinary convolution enhances both the features and complements the feature information discarded by pooling. The number of channels in the feature map is 64 throughout the structure and the output segmentation by the final 1

\times

1 convolution results in 3 channels. The whole structure is designed to enable the model to adjust the proportion of the impact of downsampling and upsampling on the segmentation map, combining the information extracted from depth features and surface features, while the different convolution operations of the three branches can fuse and strengthen the segmented type features, further improving the accuracy of the result feature map.

3. Dataset Descriptions and Experiment Settings

This paper uses the self-built Gwadar and the publicly available GID datasets to validate the MAAFEU-net from different aspects. This section details the composition of the datasets with classified land use types, shows sample examples and compares the two datasets.

3.1. Gwadar Dataset

In order to test the performance of the improved model, a semantic segmentation dataset for land use classification for the Gwadar Port in Pakistan was created, as many developing countries along the Belt and Road project need to monitor changes in land use types in a timely manner. The study area was chosen to assist in the development of local infrastructure. As the Gwadar Port protrudes from the mainland and is surrounded on three sides by the sea, the main areas of construction and human activity are in the middle of this region. Therefore, GF-2 satellite images of the area from March 2021 are cropped. The Roi function of the ENVI software is used to select areas containing mainly the land use classes required for the experiment, to discard large areas of the surrounding ocean and visually observe surrounding hills and bare ground that lack required categories, resulting in an image size of 4663 × 4679 pixels. By combining the Google Maps image with a priori knowledge of the morphology and color of the different land use types, the dataset was segmented and labeled on the original map using the Labelme image annotation tool. The land use types in the image considered in this paper are building, plant, water, port, road and background. In order to facilitate the subsequent training and testing of the model, the remote sensing image needs to be pre-processed, such as data cropping and data enhancement, in the following steps: (1) the image is randomly cropped into a 512

\times

512 pixel map, (2) the obtained image is processed with data enhancement such as rotation and flip, (3) the obtained image is divided into datasets.

Finally, 4520 sample images of size 512

\times

512 sample images were obtained, of which 80% were used as the training set and 20% were used for validation and testing. Table 1 shows the land types and color annotation in the dataset, and Table 2 counts the area and proportion of each land use type in the dataset.

3.2. GID Dataset

The GID dataset [46] is a high-resolution remote sensing image dataset collected by Wuhan University for land use classification from December 2014 to October 2016. It includes 150 high-resolution GF-2 images from more than 60 cities in China, covering a geographical area of over 50,000 km². Each remote sensing image in the dataset is 6800

\times

7200 pixels, with a spatial resolution of 1 m and consisting of three RGB channels. The six land types considered are building, farmland, water, forest, meadow and background. In this paper, all 150 images are cropped to 512

\times

512 pixels to build the dataset, resulting in 8125 cropped images—similar to the Gwadar dataset—with 80% used for training and 20% for validation and testing. The cropped dataset with the class annotation is shown in Table 3.

3.3. Comparison between Gwadar and GID Datasets Images

The experimental results of the Gwadar and GID datasets differ due to the diversity of data sources and the type of land use for which the classification is carried out. As shown in Figure 5a–c, due to the specificity of the study area in the Gwadar dataset, the types of ground objects are more distinct and few features similar to the experimental types appear in the background. Hence, the overall land use classification of the model is less difficult, but the difficult-to-identify types of ground objects such as vegetation and roads, which covers a wide area with large variation in features (Figure 5d–f), still pose a challenge to the feature extraction capability of the model. The wide range of areas makes it possible for the features of the same land use type to vary in different regions, increasing the difficulty of classification in the model. In addition, among the experimental types of ground objects, forests and meadows are more likely to have features with similar characteristics around them than other types, increasing the probability of model classification errors.

3.4. Experiment Settings

The improved and comparison models are trained and learned on NVIDIA GeForce GTX 1080Ti GPUs, and all experiments are conducted in the Pytorch framework. The models are optimized using a stochastic gradient descent (SGD) optimizer. Due to equipment limitations, the batch size used in the experiments is four. The loss function is a cross-entropy loss function. The initial learning rate is set to

{2.5 \times 10}^{- 3}

, and the learning rate is updated by the step method using the following equation:

l r = {l r}_{0} \times p^{n}

(9)

where lr is the current learning rate, lr₀ is the initial learning rate, set to 0.01, p is the decay ratio set to 0.774 in the experiments and n is obtained by dividing the learning rate update step by the number of current iterations, with the learning rate update step set to 20 for all training.

3.5. Accuracy Evaluation

The model test results are evaluated using both qualitative and quantitative criteria. The qualitative evaluation is based on subjective judgments, such as the completeness of the result map compared to the labels and the distinctness of the edge contours of the classified objects. The quantitative evaluation uses numerical values as objective criteria. In this paper, IoU, F1 score, overall accuracy and mean intersection over union (MIoU) are used to compare each type. TP is the sample in which the actual negative class is predicted to be negative. FN is the sample in which the actual positive class is predicted to be negative. TN is the sample in which the actual negative class is predicted to be negative, and FP is the sample in which the actual negative class is predicted to be positive.

The recall is the ratio of the number of correctly classified positive class samples to the number of all positive class samples, and the formula can be expressed as

R e c a l l = \frac{T P}{T P + F N}

(10)

The precision is the ratio of the number of correctly classified positive class samples to the number of positive class samples in the classification result (Equation (11)):

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

Sometimes accuracy or recall alone can be contradictory. So, the F1 score is employed to consider overall accuracy and recall together, and the formula can be expressed as

F 1 - s c o r e = 2 * \frac{R e c a l l * P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(12)

OA is the ratio of the number of correctly classified samples to the number of all samples and the formula can be expressed as

O A = \frac{T P + T N}{T P + T N + F N + F P}

(13)

MIoU is the average of the intersection ratio of true labels to predicted values for each class, enabling a global evaluation of the results of semantic segmentation. The formula can be expressed as follows:

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(14)

4. Experiment Results

To better evaluate the performance of the improved models, six technically mature deep networks are used for comparison: MANet, DC-Swin, VGG16-based U-net, Deeplabv3+, SegNet and PSPNet, all trained and tested in the same environment with the same configuration. No pre-training weights are used for each model, all are trained from scratch and all learn until convergence during the training phase before stopping. Seven ablation experiments are also conducted to verify the effectiveness of the improved modules for different land use types.

4.1. Results of Gwadar Dataset

Table 4 gives the land classification accuracy for the selected types of ground objects and the combined evaluation metrics for the seven different networks. For the single area types—building, water and background—which occupy a large area, the classification difficulty is low and MAAFEU-net has the highest accuracy but not a significant advantage. For the plant type, both U-net and MAAFEU-net can effectively use feature maps of different dimensions, so they are able to identify plants with large variations in size and sparse distribution in Gwadar. At the same time, MAAFEU-net can further extract features through the MA module to enhance the classification ability, which is more powerful than U-net. The port type is closely related to water and, as the MA module of MAAFEU-net can mine the links between different types well, it is easier for the MA module to capture the relationship between port and water and extract the corresponding type features in a targeted manner compared to the ASPP structure of Deeplabv3+ and the SPP structure of PSPNet. The road is not very recognizable in the Gwadar Port area and is highly susceptible to confusion with the irrelevant background type. Although DC-Swin uses the Swin Transformer as the feature extraction network to obtain the key features of the road category, it lacks the effective extraction of secondary and edge features by CNN, which leads to its poor recognition of slender roads. Compared with other models, MAAFEU-net is able to enhance the information representation of the road type by reducing the influence of background pixels with similar features to the road through the feature enhancement layer and improving the classification accuracy for the road type. MAAFEU-net achieves the highest performance in all types of land use classification, with an MIoU of 88.50%.

Figure 6 shows the prediction results of the different networks for the Gwadar dataset images. The segmentation results of MAAFEU-net for road and plant in Figure 6a–e are significantly better than the other models, with continuous and clear boundaries due to the ability of MAAFEU-net to reduce the effect of similar features of background and road and plant using the feature enhancement layer. The segmentation map of MAAFEU-net for the port type in Figure 6b has smoother edges compared to the other models, while two smaller ports in the lower right are also clearly segmented in Figure 6d, which indicates that MAAFEU-net captures the connection between port and water and improves the classification accuracy of edge-related pixels. The results show that MAAFEU-net has excellent land classification results for remote sensing images with high pixel complexity and strong type relevance.

4.2. Results of GID Dataset

Table 5 gives the segmentation accuracy and the overall evaluation index for the different land use types. The table shows that the improved MAAFEU-net obtains a higher accuracy rate than the other models. Although SegNet and U-net also use an encoding-decoding structure, with no further processing for deeper features and a lack of consideration for correlation between different types, SegNet has lower accuracy for building, forest and meadow type recognition. The multi-level jump connection helps U-net compensate for the loss of information from downsampling. MANet and DC-Swin utilize convolutional attention and transformer attention, respectively. However, the former lacks the transformer’s ability to mine the relationships between different categories, while the latter relies only on the Swin Transformer for feature extraction losing the CNN’s ability to efficiently characterize image features and, therefore, both perform poorly. Compared to the SPP structure of PSPNet, which can enhance the recognition of targets of different sizes to a certain extent, the ASPP structure of Deeplabv3+ further increases the perceptual field, resulting in higher overall classification accuracy which is effective for types with large differences in size, such as a forest. Because of this, Deeplabv3+ is 5.46% higher than PSPNet’s IOU in this type. In contrast, MAAFEU-net is able to extract more correlated pixels features through the MA module in the classification of different land types and, for similar features, can reduce misclassification of image elements through the processing of the feature enhancement layer, with a stronger land use classification capability and 3.15% improvement in MIoU compared to Deeplabv3+. Forest and meadow types are more likely to have similar features of background pixels interfering with model classification than other experiment types. However, MAAFEU-net is able to rely on the feature enhancement layer to reduce its effect, achieving 54.76% classification accuracy for forest and slightly less for meadow than U-net (49.53%), as seen in the ablation experiments below. This is because the CSPlayer cannot fully integrate the features of this type. In all types, MAAFEU-net is higher than the comparison models, except for meadow and background, which are slightly less accurate than the individual models, and ranks first in overall accuracy, average F1 Score and MIoU. The high accuracy achieved for forest and meadow indicates that the method fully considers the relationship between pixels in different types and can use neighboring pixels to help with feature type recognition, reducing the occurrence of misclassification.

Figure 7 shows the comparison between the prediction results of MAAFEU-net and the other six models for land use classification. MAAFEU-net effectively improves the classification of related pixels and misclassification of similar features relative to the other methods and improves the classification accuracy of the overall type. The orange boxes in the labeled images show the prediction errors from the other models. These errors are mainly found in the fragmented feature type classification, which ignores the links between different land use types and fails to capture the holistic nature of remotely sensed images. In Figure 7a,b, the comparison model does not identify the edges of the building type and the fragmented objects well because many fragmented farmlands, water and other land use types around buildings are not within the scope of the study. Other models usually do not consider the correlation between different types and cannot relate the distribution of adjacent features for classification. Due to the similarity between meadow- and forest-type features in Figure 7c,d and the surrounding environment, the classification is difficult and the comparison model has difficulty in distinguishing the target type from similar feature objects in the background, with poor prediction results. Among them, MANet and DC-Swind present particular challenges in predicting forest and meadow categories. In Figure 6e, MAAFEU-net has the most complete recognition of the farmland category, and the context between farmland is also clearly segmented. In Figure 6f, the water category in the upper left corner is missing in all the comparison models except DC-Swin and MANet. In contrast, MAAFEU-net can effectively extract the difference and connection between building, background and water, and has the best overall segmentation effect. The MAAFEU-net proposed in this paper combines the correlation between adjacent pixels and overall features through the MA module in the feature extraction process and is able to learn the association of different land use types. At the same time, the features of segmented objects are further strengthened through the feature enhancement layer to distinguish the target features from the surrounding background, which improves the classification accuracy of types that are difficult to identify and has greater advantages in land use classification of multi-types remote sensing images.

4.3. Ablation Experiments

To further analyze the effects of the model components on the classification of different types of ground objects, ablation experiments are conducted on the GID dataset (Table 6). The U-net with ResNet50 as the backbone is less effective for overall type classification, especially for the forest and meadow types, where features are difficult to identify. In contrast, the model with the added Focus structure still failed to solve the problem, but MIoU improved by 0.49%. The two attention modules, CBAM and Swin Transformer, have their respective advantages in land use classification. The combination of the two attention modules has been shown to be effective in classifying all land use types, demonstrating that the ability to link pixels and consider overall features helps to improve results. Although the recognition of most types improved with the addition of the CSPlayer structure, there is a significant decrease of 12.26% in MIoU for the meadow type. MAAFEU-net effectively acquires the connection between different land use types and reduces the impact of similar features on the final segmentation results through the feature enhancement layer, solving the problem of weakened classification ability of the CSPlayer for the meadow type. The MAAFEU-net finally ranks first among all the networks compared.

5. Conclusions

This paper proposes the MAAFEU-net, a novel land use classification model in remote sensing images. Based on the framework of the U-net model, it is built by adding Resnet50 of Focus, MA module combining Swin Transformer and CBAM, and adding a feature enhancement layer with adjustable parameters. In this paper, experimental tests are conducted on the Gwadar and GID datasets to compare the performance of the improved model with six other land use classification models. The results show that whether using single- or multi-source regional, remote sensing image data, the MA module constructed in this paper can effectively obtain the connection between different land use types and extract the corresponding category features in a more targeted way. Meanwhile, the adjustable feature enhancement layer established is helpful to reduce the influence of similar features on the classification results, refine the features of different land use types and effectively improve the classification results of all types. By combining the advantages of both, MAAFEU-net can achieve a more accurate land use classification.

However, the experimental results also show that there are still some problems with the MAAFEU-net. In the Gwadar dataset, the U-net performs better for water. In the GID dataset, the U-net outperforms the MAAFEU-net for the meadow class while maintaining good identification of water. The ablation experiments show that integrating features from the two attention modules in the CSPlayer can effectively improve the recognition of most land use types, especially water classes, yet significantly reduce the accuracy of the meadow class. Finally, it is the feature enhancement layer that produces better results overall. Thus, combining multiple attention modules does not improve the ability to focus on important information for all categories, and overly deep mining of feature associations may destroy the original features of the meadow category. A simpler U-net model would be a better choice for features in categories such as meadow and water.

In the future, we need to refine the MAAFEU-net and, crucially, modify the approach of combining different attention modules with respect to the object of study so that the model can achieve good classification results for all categories.

Author Contributions

Conceptualization, Huajun Zhao and Yonghong Zhang; methodology, Huajun Zhao; validation, Huajun Zhao, Yonghong Zhang and Wei Tian; formal analysis, Huajun Zhao; investigation, Huajun Zhao; resources, Yonghong Zhang; data curation, Huajun Zhao; writing—original draft preparation, Huajun Zhao; writing—review and editing, Yonghong Zhang and Guangyi Ma and Kenny Thiam Choy Lim Kam Sian; visualization, Huajun Zhao; supervision, Wei Tian, Donglin Xie, Sutong Geng, Guangyi Ma and Huanyu Lu; project administration, Yonghong Zhang; funding acquisition, Yonghong Zhang All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant (No.2021-YFE0116900); the National Natural Science Foundation of China under Grant (No.42175157); Fengyun Application Pioneering Project (FY-APP) under Grant (No. FY-APP-2022.0604).

Data Availability Statement

The Gwadar dataset is available at the following website: https://gas.graviti.com/dataset/xvxibu/GwadarDataset (accessed on 7 December 2022).

Acknowledgments

We authors thank the National Meteorological Administration of China for providing the meteorological data for this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, X.; He, J.; Yao, Y.; Zhang, J.; Liang, H.; Wang, H.; Hong, Y. Classifying urban land use by integrating remote sensing and social media data. Int. J. Geogr. Inf. Sci. 2017, 31, 1675–1696. [Google Scholar] [CrossRef]
Hashem, N.; Balakrishnan, P. Change Analysis of Land Use/Land Cover and Modelling Urban Growth in Greater Doha, Qatar. Ann. GIS 2015, 21, 233–247. [Google Scholar] [CrossRef]
Rahman, A.; Kumar, S.; Fazal, S.; Siddiqui, M.A. Assessment of Land Use/Land Cover Change in the North-West District of Delhi Using Remote Sensing and GIS Techniques. J. Indian Soc. Remote Sens. 2012, 40, 689–697. [Google Scholar] [CrossRef]
Nguyen, K.-A.; Liou, Y.-A. Mapping Global Eco-Environment Vulnerability Due to Human and Nature Disturbances. MethodsX 2019, 6, 862–875. [Google Scholar] [CrossRef]
Talukdar, S.; Pal, S. Wetland Habitat Vulnerability of Lower Punarbhaba River Basin of the Uplifted Barind Region of Indo-Bangladesh. Geocarto Int. 2020, 35, 857–886. [Google Scholar] [CrossRef]
Nguyen, A.K.; Liou, Y.-A.; Li, M.-H.; Tran, T.A. Zoning Eco-Environmental Vulnerability for Environmental Management and Protection. Ecol. Indic. 2016, 69, 100–117. [Google Scholar] [CrossRef]
Dao, P.; Liou, Y.-A. Object-Based Flood Mapping and Affected Rice Field Estimation with Landsat 8 OLI and MODIS Data. Remote. Sens. 2015, 7, 5077–5097. [Google Scholar] [CrossRef]
Liou, Y.-A.; Kar, S.; Chang, L. Use of High-Resolution FORMOSAT-2 Satellite Images for Post-Earthquake Disaster Assessment: A Study Following the 12 May 2008 Wenchuan Earthquake. Int. J. Remote. Sens. 2010, 31, 3355–3368. [Google Scholar] [CrossRef]
Liou, Y.-A.; Sha, H.-C.; Chen, T.-M.; Wang, T.-S.; Li, Y.-T.; Lai, Y.-C.; Chiang, M.-H. Assessment of Disaster Losses in Rice Paddy Field and Yield after Tsunami Induced by the 2011 Great East Japan Earthquake. J. Mar. Sci. Technol. 2012, 20, 2. [Google Scholar]
Zhang, Y.; Ge, T.; Tian, W.; Liou, Y.-A. Debris Flow Susceptibility Mapping Using Machine-Learning Techniques in Shigatse Area, China. Remote. Sens. 2019, 11, 2801. [Google Scholar] [CrossRef]
Talukdar, S.; Pal, S. Effects of Damming on the Hydrological Regime of Punarbhaba River Basin Wetlands. Ecol. Eng. 2019, 135, 61–74. [Google Scholar] [CrossRef]
Langat, P.K.; Kumar, L.; Koech, R.; Ghosh, M.K. Monitoring of Land Use/Land-Cover Dynamics Using Remote Sensing: A Case of Tana River Basin, Kenya. Geocarto Int. 2021, 36, 1470–1488. [Google Scholar] [CrossRef]
Zomer, R.J.; Trabucco, A.; Ustin, S.L. Building Spectral Libraries for Wetlands Land Cover Classification and Hyperspectral Remote Sensing. J. Environ. Manag. 2009, 90, 2170–2177. [Google Scholar] [CrossRef] [PubMed]
Talukdar, S.; Singha, P.; Mahato, S.; Shahfahad; Pal, S.; Liou, Y.-A.; Rahman, A. Land-Use Land-Cover Classification by Machine Learning Classifiers for Satellite Observations—A Review. Remote Sens. 2020, 12, 1135. [Google Scholar] [CrossRef]
Abdi, A.M. Land Cover and Land Use Classification Performance of Machine Learning Algorithms in a Boreal Landscape Using Sentinel-2 Data. GIScience Remote Sens. 2020, 57, 1–20. [Google Scholar] [CrossRef]
Wang, J.; Bretz, M.; Dewan, M.A.A.; Delavar, M.A. Machine Learning in Modelling Land-Use and Land Cover-Change (LULCC): Current Status, Challenges and Prospects. Sci. Total Environ. 2022, 822, 153559. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, C.; Kafy, A.-A.; Tan, S. Simulating the Relationship between Land Use/Cover Change and Urban Thermal Environment Using Machine Learning Algorithms in Wuhan City, China. Land 2021, 11, 14. [Google Scholar] [CrossRef]
Hao, J.; Lin, Q.; Wu, T.; Chen, J.; Li, W.; Wu, X.; Hu, G.; La, Y. Spatial–Temporal and Driving Factors of Land Use/Cover Change in Mongolia from 1990 to 2021. Remote Sens. 2023, 15, 1813. [Google Scholar] [CrossRef]
Carlier, J.; Doyle, M.; Finn, J.A.; Ó hUallacháin, D.; Moran, J. A Landscape Classification Map of Ireland and Its Potential Use in National Land Use Monitoring. J. Environ. Manag. 2021, 289, 112498. [Google Scholar] [CrossRef]
Kaczmarek, I.; Iwaniak, A.; Świetlicka, A.; Piwowarczyk, M.; Nadolny, A. A Machine Learning Approach for Integration of Spatial Development Plans Based on Natural Language Processing. Sustain. Cities Soc. 2022, 76, 103479. [Google Scholar] [CrossRef]
Zhang, M.; Kafy, A.-A.; Xiao, P.; Han, S.; Zou, S.; Saha, M.; Zhang, C.; Tan, S. Impact of Urban Expansion on Land Surface Temperature and Carbon Emissions Using Machine Learning Algorithms in Wuhan, China. Urban Clim. 2023, 47, 101347. [Google Scholar] [CrossRef]
Rahnama, M.R. Forecasting land-use changes in Mashhad Metropolitan area using Cellular Automata and Markov chain model for 2016-2030. Sustain. Cities Soc. 2020, 64, 102548. [Google Scholar] [CrossRef]
Sobhani, P.; Esmaeilzadeh, H.; Mostafavi, H. Simulation and impact assessment of future land use and land cover changes in two protected areas in Tehran, Iran. Sustain. Cities Soc. 2021, 75, 103296. [Google Scholar] [CrossRef]
Shapiro, L.G.; Stockman, G.C. Computer Vision; Prentice Hall: Upper Saddle River, NJ, USA, 2001; Volume 3. [Google Scholar]
Liang, X.; Luo, C.; Quan, J.; Xiao, K.; Gao, W. Research on Progress of Image Semantic Segmentation Based on Deep Learning. Comput. Eng. Appl. 2020, 56, 18–28. [Google Scholar]
Yu, M.T.; Sein, M.M. Automatic Image Captioning System Using Integration of N-Cut and Color-Based Segmentation Method. In Proceedings of the SICE Annual Conference 2011, Tokyo, Japan, 13–18 September 2011; pp. 28–31. [Google Scholar]
Wang, D.; He, K.; Wang, B.; Liu, X.; Zhou, J. Solitary Pulmonary Nodule Segmentation Based on Pyramid and Improved Grab Cut. Comput. Methods Programs Biomed. 2021, 199, 105910. [Google Scholar] [CrossRef] [PubMed]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 833–851. ISBN 978-3-030-01233-5. [Google Scholar]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Yuan, X.; Chen, Z.; Chen, N.; Gong, J. Land cover classification based on the PSPNet and superpixel segmentation methods with high spatial resolution multispectral remote sensing imagery. J. Appl. Remote. Sens. 2021, 15, 034511. [Google Scholar] [CrossRef]
Zhan, Z.Q.; Zhang, X.M.; Liu, Y.; Sun, X.; Pang, C.; Zhao, C.B. Vegetation Land Use/Land Cover Extraction From High-Resolution Satellite Images Based on Adaptive Context Inference. IEEE Access 2020, 8, 21036–21051. [Google Scholar] [CrossRef]
Hu, Y.; He, H.; Weng, L. Hyperspectral and LiDAR Data Land-Use Classification Using Parallel Transformers. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 703–706. [Google Scholar]
Li, Z.; Chen, G.; Zhang, T. A CNN-Transformer Hybrid Approach for Crop Classification Using Multitemporal Multisensor Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 847–858. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, X.; Xin, Q.; Huang, J. Developing a Multi-Filter Convolutional Neural Network for Semantic Segmentation Using High-Resolution Aerial Imagery and LiDAR Data. ISPRS J. Photogramm. Remote Sens. 2018, 143, 3–14. [Google Scholar] [CrossRef]
Wang, X.; Zhang, X.; Su, C. Land Use Classification of Remote Sensing Images Based on Multi-Scale Learning and Deep Convolution Neural Network. J. ZheJiang Univ. Sci. Ed. 2020, 47, 715–723. [Google Scholar]
Shi, J.; Zhang, X.; Liu, X.; Lei, Y. Deep Change Feature Analysis Network for Observing Changes of Land Use or Natural Environment. Sustain. Cities Soc. 2021, 68, 102760. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2015; pp. 770–778. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
Ultralytics. Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 23 May 2022).
Wang, C.-Y.; Mark Liao, H.-Y.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the MAAFEU-net.

Figure 2. (a) Overall architecture of the Swin Transformer blocks; (b) W-MSA, where the feature map is windowed to calculate self-attention; and (c) SW-MSA, where the sliding window is followed by a mask with a moving local window to achieve the equivalent of four windows to calculate self-attention.

Figure 3. (a) Overall architecture of CBAM, (b) CAM calculation process, and (c) SAM calculation process.

Figure 4. Feature enhancement layer.

Figure 5. Comparison of images from the Gwadar and GID datasets. (a–c) Gwadar dataset and (d–h) GID dataset.

Figure 6. (a–e) show the comparison of experimental results of different models on the Gwadar dataset with different land use types.

Figure 7. (a–f) show the comparison of experimental results of different models on the GID dataset with different land use types.

Table 1. Samples of the Gwadar dataset.

Land-Use Class	Introduction	Original Image	Labeled Image
Building (labeled in red)	Refers to urban residential land and the land for facilities and factories
Plant (labeled in green)	Refers to land for trees, meadows and other crops
Water (labeled in blue)	Refers to land for the sea and pond
Port (labeled in purple)	Refers to land for berthing vessels and cargo handling
Road (labeled in orange)	Refers to land used for transportation
Background (labeled in black)	Refers to land that is difficult to identify or not in the study type

Table 2. Statistics of areas and proportions of different land-use types.

Land-Use Class	Area (m²)	Proportion (%)
Building	300,215,284	25.34
Plant	26,752,008	2.26
Water	285,282,488	24.08
Port	3,275,696	0.28
Road	30,865,572	2.60
Background	538,499,832	45.44
Total	1,184,890,880	100

Table 3. Samples of the GID dataset.

Land-Use Class	Original Image	Labeled Image
Building (labeled in red)
Farmland (labeled in green)
Water (labeled in blue)
Forest (labeled in light blue)
Meadow (labeled in yellow)
Background (labeled in black)

Table 4. Experimental results of different land use types and overall evaluation using the Gwadar dataset.

Method	IOU of Different Land Use Types						Overall Accuracy	Average F1 Score	MIoU
Method	Building	Plant	Water	Port	Road	Background	Overall Accuracy	Average F1 Score	MIoU
DC-Swin	86.16	52.23	98.52	72.38	45.52	88.07	93.48	83.87	73.83
SegNet	88.27	58.81	98.71	76.48	59.23	89.80	94.63	87.32	78.55
MANet	88.07	57.65	98.73	79.42	57.60	90.04	94.66	87.53	78.59
PSPNet	90.23	60.13	98.67	77.88	59.66	91.34	95.32	88.04	79.65
Deeplabv3+	90.49	61.42	98.38	77.07	61.42	91.07	95.23	88.26	80.11
U-net	92.01	69.84	99.01	86.91	68.61	92.88	96.26	91.39	84.88
MAAFEU-net	92.71	76.37	98.94	92.00	77.32	93.65	96.78	93.69	88.50

Table 5. Experimental results of different land-cover types and overall evaluation using the GID dataset.

Method	IOU of Different Land Use Types						Overall Accuracy	Average F1 Score	MIoU
Method	Building	Farmland	Water	Forest	Meadow	Background	Overall Accuracy	Average F1 Score	MIoU
DC-Swin	58.24	43.79	66.88	28.70	8.85	55.83	68.46	62.63	43.71
SegNet	32.12	65.55	70.49	8.20	35.44	59.43	73.08	64.57	45.21
MANet	60.24	56.18	71.63	43.96	38.70	57.41	72.86	70.38	54.69
PSPNet	61.86	62.13	70.49	39.29	45.19	60.47	75.69	67.84	56.99
U-net	62.40	55.42	74.19	48.55	51.61	58.79	73.91	73.80	58.49
Deeplabv3+	62.58	64.39	71.99	44.75	48.50	59.71	76.10	74.22	58.65
MAAFEU-net	66.01	66.15	74.44	54.76	49.53	59.93	77.35	76.38	61.80

Table 6. Ablation experiment results on the GID dataset.

Method	IOU of Different Land Use Types						MIoU
	Building	Farmland	Water	Forest	Meadow	Background	MIoU
U-net (ResNet50)	55.68	61.2	60.54	33.17	29.98	58.83	49.90
U-net + Focus	54.52	56.49	67.75	27.29	36.94	59.35	50.39
U-net + Focus + CBAM	61.13	60.56	67.62	37.33	37.71	55.83	53.36
U-net + Focus + Swin Transformer	59.16	57.61	64.75	38.13	34.35	57.14	51.85
U-net + Focus + Swin Transformer + CBAM	55.86	60.79	66.15	45.08	41.73	60.49	55.01
U-net + Focus + MA	60.36	65.37	74.14	47.24	29.47	60.74	56.22
MAAFEU-net	66.01	66.15	74.44	54.76	49.53	59.93	61.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Zhao, H.; Ma, G.; Xie, D.; Geng, S.; Lu, H.; Tian, W.; Lim Kam Sian, K.T.C. MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images. ISPRS Int. J. Geo-Inf. 2023, 12, 206. https://doi.org/10.3390/ijgi12050206

AMA Style

Zhang Y, Zhao H, Ma G, Xie D, Geng S, Lu H, Tian W, Lim Kam Sian KTC. MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images. ISPRS International Journal of Geo-Information. 2023; 12(5):206. https://doi.org/10.3390/ijgi12050206

Chicago/Turabian Style

Zhang, Yonghong, Huajun Zhao, Guangyi Ma, Donglin Xie, Sutong Geng, Huanyu Lu, Wei Tian, and Kenny Thiam Choy Lim Kam Sian. 2023. "MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images" ISPRS International Journal of Geo-Information 12, no. 5: 206. https://doi.org/10.3390/ijgi12050206

APA Style

Zhang, Y., Zhao, H., Ma, G., Xie, D., Geng, S., Lu, H., Tian, W., & Lim Kam Sian, K. T. C. (2023). MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images. ISPRS International Journal of Geo-Information, 12(5), 206. https://doi.org/10.3390/ijgi12050206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAAFEU-Net: A Novel Land Use Classification Model Based on Mixed Attention Module and Adjustable Feature Enhancement Layer in Remote Sensing Images

Abstract

1. Introduction

2. Methods

2.1. General Framework of the Model

2.2. Main Components of the Model

2.2.1. Focus

2.2.2. Mixed Attention Module

Swin Transformer Blocks

Convolutional Block Attention Module (CBAM)

Cross Stage Partial Layer

2.2.3. Adjustable Feature Enhancement Layer

3. Dataset Descriptions and Experiment Settings

3.1. Gwadar Dataset

3.2. GID Dataset

3.3. Comparison between Gwadar and GID Datasets Images

3.4. Experiment Settings

3.5. Accuracy Evaluation

4. Experiment Results

4.1. Results of Gwadar Dataset

4.2. Results of GID Dataset

4.3. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI