Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks

Ding, Xue; Wang, Zhaoqian; Peng, Shuangyun; Shao, Xin; Deng, Ruifang

doi:10.3390/ijgi13110386

Open AccessArticle

Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks

by

Xue Ding

^1,2,3

,

Zhaoqian Wang

⁴,

Shuangyun Peng

^1,*,

Xin Shao

^1,2,3 and

Ruifang Deng

⁴

¹

Faculty of Geography, Yunnan Normal University, Kunming 650500, China

²

Key Laboratory of Resources and Environmental Remote Sensing for Universities in Yunnan Kunming, Kunming 650500, China

³

Center for Geospatial Information Engineering and Technology of Yunnan Province, Kunming 650500, China

⁴

School of Information Science and Technology, Yunnan Normal University, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2024, 13(11), 386; https://doi.org/10.3390/ijgi13110386

Submission received: 31 July 2024 / Revised: 1 October 2024 / Accepted: 29 October 2024 / Published: 31 October 2024

Download

Browse Figures

Versions Notes

Abstract

To address the challenges that convolutional neural networks (CNNs) face in extracting small objects and handling class imbalance in remote sensing imagery, this paper proposes a novel spatial contextual information and multiscale feature fusion encoding–decoding network, SCIMF-Net. Firstly, SCIMF-Net employs an improved ResNeXt-101 deep backbone network, significantly enhancing the extraction capability of small object features. Next, a novel PMFF module is designed to effectively promote the fusion of features at different scales, deepening the model’s understanding of global and local spatial contextual information. Finally, introducing a weighted joint loss function improves the SCIMF-Net model’s performance in extracting LULC information under class imbalance conditions. Experimental results show that compared to other CNNs such as Res-FCN, U-Net, SE-U-Net, and U-Net++, SCIMF-Net improves PA by 0.68%, 0.54%, 1.61%, and 3.39%, respectively; MPA by 2.96%, 4.51%, 2.37%, and 3.45%, respectively; and MIOU by 3.27%, 4.89%, 4.2%, and 5.68%, respectively. Detailed comparisons of locally visualized LULC information extraction results indicate that SCIMF-Net can accurately extract information from imbalanced classes and small objects.

Keywords:

convolutional neural network; land use and land cover information extraction; ResNeXt-101; PMFF (parallel multiscale feature extraction fusion module); weighted joint loss function

1. Introduction

In recent years, with the rapid development of remote sensing technology, computer technology, and emerging technologies such as big data and cloud computing, humans have acquired vast, multisource, heterogeneous remote sensing data resources. These data are characterized by their large volume, diversity, high accuracy, and rapid change [1], providing new opportunities for the precise extraction of land use and land cover (LULC) remote sensing information. However, they also pose higher demands on extraction methods. Traditional LULC information extraction methods can no longer meet current needs. At the same time, the rise of artificial intelligence offers crucial support for the efficient analysis and in-depth mining of remote sensing big data [2,3]. Intelligent extraction of LULC information has become a research hotspot in remote sensing [4]. Therefore, effectively integrating these emerging data and technologies, fully exploiting the spatial, temporal, and attribute dimensions of remote sensing data [5], and developing new LULC remote sensing information extraction methods have become urgent issues to address.

With the gradual application of deep learning in remote sensing, significant progress and achievements have been made in intelligently extracting LULC information from remote sensing images. However, existing methods still have notable shortcomings. Firstly, the studies focus on extraction of single land cover type, such as roads [6,7,8], croplands [9,10], buildings [11,12,13], water bodies [14,15], woodlands [16,17], grasslands [18,19], and impervious surfaces [20,21], with few studies on extracting multielement LULC information. There is an urgent need to extract more types of LULC information [22]. Most deep learning algorithms perform well on specific classic datasets but require improvements in generalization and applicability to remote sensing images and study areas [23,24,25,26,27]. Secondly, due to the diversity of LULC types and their significant spatial heterogeneity and similarity, errors and confusion can quickly occur during extraction, often resulting in low accuracy [28]. This is particularly true when dealing with small target objects whose information is widely distributed and complex, making them susceptible to background noise interference. Traditional CNN models have limited recognition capabilities, struggling to accurately parse complex spatial relationships, leading to inaccurate or missing feature extraction of small target objects. Researchers are committed to designing more complex and efficient network structures [29,30] to extract small target features from remote sensing images accurately. However, small target objects in remote sensing images often have significant scale differences with their surrounding environment. These improvements mainly focus on local features, which limits the capture of extensive contextual information, especially cross-scale semantic information. Additionally, there is considerable variability in the fusion of small target features. If improper multiscale feature fusion methods are used, it may lead to mismatches in object information, weakening the fusion capability between different small target features and limiting effective extraction. Therefore, further exploration and innovation of more efficient and precise methods and techniques are needed to extract small target features in remote sensing images. Meanwhile, there is a significant difference in the number of different categories of land cover targets in remote sensing images, leading to a prominent class imbalance phenomenon. Some scholars have attempted to use improved loss functions to address this issue and alleviate the class imbalance problem [31,32]. However, in cases of extreme class imbalance, these single-loss functions often focus on the more abundant classes while neglecting the less abundant ones. This makes it challenging to extract compelling features from minority class samples, quickly causing overfitting to the more numerous land cover categories, resulting in poor extraction performance of CNN models for minority class information.

Therefore, given the urgent need to extract more types of LULC information from remote sensing images and to enhance the accuracy and generalization of models in handling small target objects and addressing class imbalance issues, this paper proposes an innovative spatial context information and multiscale feature fusion encoder–decoder network—SCIMF-Net. SCIMF-Net employs and improves ResNeXt as the backbone network to achieve multiscale feature extraction and fusion, significantly enhancing the expressiveness and richness of features and thereby improving the model’s understanding of complex scenes. Additionally, the network incorporates a channel attention mechanism to understand better the relationship between small target objects and their surrounding environment, thus increasing segmentation accuracy. To effectively address the issue of class imbalance, SCIMF-Net uses a weighted joint loss function that dynamically adjusts the weights of different classes, enabling the model to focus more on samples from less abundant land cover categories. Experimental results demonstrate that SCIMF-Net not only effectively extracts more types of LULC information but also successfully addresses the challenges of class imbalance and small target object information extraction, showcasing its significant advantages and application potential in LULC information extraction from remote sensing images.

Based on the above analysis, addressing the challenges faced by existing mainstream CNN models in extracting LULC information from remote sensing images—such as difficulties in understanding the complex spatial relationships and semantic information between land cover types, insufficient capture of contextual information, the imbalance in land cover categories making it difficult to extract information from less common categories, and the imprecision in extracting small-object land cover information—this paper proposes a new network, SCIMF-Net (spatial context information and multiscale feature fusion encoding–decoding network), for LULC information extraction from remote sensing images.

Overall, the main contributions are as follows:

The SCIMF-Net employs an enhanced ResNeXt-101 deep backbone network, which markedly improves the extraction capability of features related to small target objects.
A new PMFF module is developed to facilitate the integration of features at various scales, thereby enriching the model’s comprehension of spatial context at both global and local levels.
We introduce a weighted joint loss function that further enhances the SCIMF-Net’s performance in LULC information extraction, especially when addressing class imbalances among different object categories.

2. Related Work

2.1. The Landsat Remote Sensing Images and Labeled Samples of the Central Yunnan Urban Agglomeration

This study utilizes a dataset of remote sensing images synthesized from Landsat 5 TM for 2000 and 2010 and Landsat 8 OLI for 2020 from the Central Yunnan Urban Agglomeration. These images were preprocessed using the Google Earth Engine platform and have a spatial resolution of 30 m. Visual interpretation was conducted by integrating Google high-resolution images, field survey data from the corresponding years, and existing 30 m land use products. ArcGIS 10.8 software was then used to create deep learning labels based on the remote-sensing images. The land classification system follows the 2017 “Current Land Use Classification” (GB/T010-2017) and the CAS classification system, employing a two-level classification: six primary land categories and seventeen secondary land categories, as detailed in Table 1.

Data processing workflow: This study uses preprocessed data to train all models and validate their effectiveness in extracting information on more secondary land cover categories. The accuracy and efficiency of the extraction models are evaluated by comparing the results with the actual label data. For model training, detailed labeling was conducted for different secondary land cover categories in Landsat 5 TM (as shown in Figure 1) and Landsat 8 OLI (as shown in Figure 2) images. These secondary categories include 17 types, such as paddy fields, dry land, and forested land, with assigned values from 1 to 17. The data were projected to WGS_1984_UTM_Zone_48N, with a spatial resolution adjustment to 30 × 30 m.

Data normalization and augmentation: The DN values of the original Landsat 5 TM and 8 OLI images were normalized to a range of 0–255. The dataset was expanded by cropping, flipping, salt and pepper noise, random rotation, and other data enhancement methods (Figure 3). After row and column alignment, the original training dataset was cropped to 256 × 256 in size with 8 channels, resulting in 47,239 remote sensing images and labels, which were divided into a training set and a validation set in a ratio of 7:3.

2.2. Deep Learning Experimental Environment and Evaluation Metrics Experimental Environment

The experimental environment used in this study includes Windows 11 x64, Intel(R) Xeon(R) Silver 4208 CPU @ 2.10 GHz, 128 GB DDR4 RAM, Python 3.11.6, VSCode 2023, and Conda 23.10.0.

The training environment consists of PyTorch 2.1.1 and CUDA 11.8, with dual NVIDIA GeForce RTX 3090 GPUs providing 48 GB of VRAM. The parameters are set as follows: CNN network with 150 epochs, batch size of 50, input image size of 8 × 256 × 256, with the use of the SGD optimizer to optimize the learning rate. This experiment aims to fully utilize the PyTorch and CUDA environments for deep learning training, particularly for CNNs, by continuously optimizing parameter settings to ensure the stability and efficiency of the training process.

To uniformly compare the performance and accuracy of different convolutional neural networks in LULC information extraction, a series of evaluation metrics were used. These include the following:

(1): Pixel Accuracy (PA)

PA is the ratio of correctly predicted pixels to the total number of pixels in the LULC information extraction results. It measures the model’s accuracy in correctly predicting each pixel. The formula for calculating PA is

PA = \frac{\sum_{i = 1}^{n} M_{i i}}{\sum_{i = 1}^{n} \sum_{j = 1}^{n} M_{i j}}

(1)

(2): Mean Pixel Accuracy (MPA)

MPA calculates the average pixel accuracy across all classes, reflecting the model’s average predictive ability for each class. The formula for MPA is

M P A = \frac{1}{n} \sum_{i = 1}^{n} \frac{{TP}_{i}}{{TP}_{i} + F N_{i}}

(2)

(3): Mean Intersection over Union (MIOU)

MIOU is the average ratio of the intersection to the union of the predicted and ground truth areas for each class. The formula for calculating mIoU is

M I O U = \frac{1}{n} \sum_{i = 1}^{n} \frac{{TP}_{i}}{{TP}_{i} + F P_{i} + F N_{i}}

(3)

3. Method

3.1. Overview

The SCIMF-Net spatial context information and multiscale feature fusion encoder–decoder network, as shown in Figure 4, consists of three main components: the SCIMF-Net encoder, the PMFF module, and the SCIMF-Net decoder. SCIMF-Net Encoder: Responsible for extracting features from remote sensing images by progressively reducing the spatial dimensions to capture high-level abstract features. PMFF Module: Captures mid-to-high-level multiscale features, extracting rich contextual information. SCIMF-Net Decoder: Gradually restores the spatial resolution of remote sensing images through upsampling, convolution, and activation functions. It uses skip connections to combine low-level features from the encoder with mid- to high-level features from the PMFF module, continuously acquiring multiscale information to generate the segmented remote sensing image.

3.2. Encoder of SCIMF-Net

ResNeXt-101 [33] is an advanced convolutional neural network that incorporates the design principles of the Inception architecture. Utilizing grouped convolutions and multiple parallel branches effectively expands the network’s capacity, achieving high-dimensional feature representation. ResNeXt-101 demonstrates strong adaptability for extracting complex semantic information when applied to remote sensing image processing. Its multibranch structure allows for multiscale feature fusion at different levels, significantly enhancing boundary and local information accuracy in LULC extraction. However, ResNeXt-101 has limitations in identifying small target information in complex scenes due to background noise and redundant irrelevant features. It also struggles with small-scale variations in remote-sensing images. To address these limitations, this paper improves the basic structure of the original ResNeXt-101 network, specifically the ResNeXt Residual block, by integrating the SE attention mechanism [34]. This mechanism recalibrates feature channels, enabling the model to focus more on essential features needed for land cover information extraction while suppressing unnecessary features, thereby enhancing the backbone network’s expressive and generalization capabilities. The proposed SE-ResNeXt Residual block, as shown in Figure 5, helps the backbone network more accurately identify and segment different semantic regions in images, improving segmentation precision and efficiency.

To adjust the feature map dimensions, a 1 × 1 convolution is used for effective dimensionality reduction or expansion, specifically enlarging the feature map size by four times. Subsequently, 64 groups of grouped convolutions are applied to images with a size of 1/16 of the original, aiming to capture rich features from various perspectives and integrate them through concatenation. Next, another 1 × 1 convolution is used to restore the feature map dimensions, ensuring that they match the original input, as calculated using Equation (4). On this basis, the SE attention mechanism is introduced to enhance focus on critical features and suppress irrelevant information, thereby improving feature map precision. Finally, the precise features processed by the SE attention mechanism are added to the output obtained from a specific residual function calculated using Equation (5), as shown in Equation (6), forming the final output feature map. This approach optimizes the feature extraction process and enhances the accuracy and efficiency of remote sensing image analysis.

X_{c}^{'} = σ ((ω_{2} \cdot ρ (\frac{ω_{1}}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{d, i, j})) \cdot X_{c}

(4)

In this context,

X_{c}^{'}

denotes the attention applied to channel C of the input feature map

X_{c}

, specifically the c-th channel. The dimensions H and W represent the height and width of the feature map, respectively. The variable

ρ

signifies the ReLU activation function, while

σ

indicates the sigmoid activation function that normalizes the weights to the interval [0, 1]. Additionally,

ω_{1}

and

ω_{2}

are the weights of the fully connected layer, and

X_{d, i, j}

refers to the value of the element at position i, j in the d-th channel of the input feature map.

Let the input feature map be denoted as X. The residual function can be expressed as

F (X, {ω_{i}}) = \sum_{i = 1}^{N} ρ (ϕ (ω_{i} * X_{i}))

(5)

Here,

{ω_{i}}

represents the parameter set,

X_{i}

corresponds to the i-th group of the input feature map X,

ω_{i}

denotes the convolutional kernel parameters for the i-th group,

ϕ

indicates the batch normalization operation,

ρ

refers to the ReLU activation function, and

N

signifies the number of groups in the grouped convolution.

Let the output feature map be denoted as Y. The formula for Y is as follows:

Y = S (F (X, {ω_{i}})) + X

(6)

In this context, S denotes the function of the attention mechanism.

The original ResNeXt-101 network is primarily used for classification tasks in computer vision, but it has limitations when applied to LULC information extraction. This paper redesigns and improves the original ResNeXt-101 structure and parameters to expand its applicability, resulting in the SE-ResNeXt-101 network. This improved network serves as the backbone encoder for SCIMF-Net, used for multiscale feature extraction in remote sensing images. The SE-ResNeXt-101 architecture consists of multiple SE-ResNeXtBlock modules, basic convolution operations, and pooling layers stacked together. The specific improved parameters are detailed in Table 2. These enhancements aim to improve network performance and applicability for LULC information extraction tasks.

3.3. Parallel Multiscale Feature Extraction Fusion Module (PMFF)

In response to the problems of insufficient spatial context information acquisition and poor multiscale feature fusion capabilities in the encoding stage of traditional convolutional neural networks, this paper innovatively designs a parallel multiscale feature extraction and fusion module (PMFF). As shown in Figure 6, this module parallelly processes the low-level features

F_{L}

, intermediate features

F_{M}

, and high-level features

F_{H}

extracted by the SE-ResNeXt-101 backbone network, uses 3 × 3 convolution blocks for feature extraction, and then splices and fuses them after upsampling to unify the feature map size. Furthermore, 3 × 3 hole-depth separable convolutions are used for multiscale feature extraction, and are then spliced and fused again to obtain new output low-level features

F_{L}

, intermediate features

F_{M}

, and high-level features

F_{H}

. This design enables the latest features to have more refined information than the original features, effectively improving the feature expression ability and network performance.

The overall process of the PMFF multiscale feature extraction and fusion is as follows: the original remote sensing images are processed through the improved ResNeXt-101 backbone network to extract features

F_{L}

,

F_{M}

, and

F_{H}

. Let K represent the number of standard convolution branches, M the number of modules, and L the number of depthwise separable convolution branches. Using Equation (7), each branch employs a specific standard convolution function

F_{k}

for feature extraction. The scale of the feature maps is adjusted using the upsampling operation S as described in Equation (8), resulting in resized feature maps

{F_{k}}^{″}

. In the multibranch parallel structure, feature maps are concatenated using function C as defined in Equation (9), yielding the concatenated feature map

F_{C o n c a t}

. Subsequently, Equation (10) applies the dilated depthwise separable convolution function

F_{d}

for multiscale feature extraction on the concatenated feature maps. Finally, Equation (11) uses a concatenation function to produce the output feature map

{F_{C o n c a t M}}^{″}

.

{F_{k}}^{'} = F_{k} (F_{L}, F_{M}, F_{H})

(7)

{F_{k}}^{″} = S ({F_{k}}^{'}, H_{i}, W_{i})

(8)

{F_{C o n c a t}}^{″} = C ({F_{d 1}}^{'}, {F_{d 2}}^{'}, {F_{d 3}}^{'} \dots {F_{d L}}^{'})

(9)

{F_{d}}^{'} = F_{d} (F_{C o n c a t 1}, F_{C o n c a t 2}, F_{C o n c a t 3} \dots F_{C o n c a t M})

(10)

{F_{C o n c a t M}}^{″} = C ({F_{d 1}}^{'}, {F_{d 2}}^{'}, {F_{d 3}}^{'} \dots {F_{d L}}^{'})

(11)

3.4. Weighted Joint Loss Function

The formula for the cross-entropy loss function is as follows:

L_{Cross entropy} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{J} y_{i, j} \log ({\hat{y}}_{i, j})

(12)

In this context, N represents the total number of pixels in the remote sensing image, while J denotes the total number of land cover types. The label

y_{i, j}

is equal to 1 if the actual pixel i belongs to category j, and 0 otherwise.

{\hat{y}}_{i, j}

indicates the probability predicted by the model that pixel i belongs to category j, and

\log

refers to the natural logarithm.

The formula for the focal loss function is as follows:

L_{Focal Loss} = - \frac{1}{N} \sum_{i = 1}^{N} α_{i} {(1 - {\hat{y}}_{i})}^{γ} \log ({\hat{y}}_{i})

(13)

In this context,

α_{i}

represents the weight for category i,

γ

is the modulation parameter, and

{\hat{y}}_{i}

denotes the probability predicted by the model for pixel i.

The formula for the Dice loss function is as follows:

L_{Dice Loss} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} {\hat{y}}_{i}}{\sum_{i = 1}^{N} y_{i}^{2} + \sum_{i = 1}^{N} {\hat{y}}_{i}^{2}}

(14)

The formula for the Dice loss function is as follows: where N represents the total number of pixels in the remote sensing image, and

{\hat{y}}_{i}

denotes the probability predicted by the model for pixel i.

This study employs a weighted joint loss function, as expressed in Equation (15):

L_{C o m b i n e d L o s s} = λ L_{Cross entropy} + μ L_{Focal Loss} + ω L_{Dice Loss}

(15)

In this context,

L_{C o m b i n e d L o s s}

,

L_{Cross entropy}

,

L_{Focal Loss}

, and

L_{Dice Loss}

denote the joint loss, cross-entropy loss, focal loss, and Dice loss, respectively. The variables

λ

,

μ

, and

ω

refer to adaptive percentage parameters.

3.5. SCIMF-Net Decoder

The structure of the SCIMF-Net decoder, as shown in Figure 7, integrates features extracted from the pretrained network in the encoder structure, including ConV_Head and features from the SE-ResNeXt-101 backbone network’s Stage1, Stage2, Stage3, and Stage4. Skip connections incorporate multistage features from the encoder into corresponding levels of the decoder. Starting from Stage 4, two consecutive 3 × 3 convolutional blocks are used for feature refinement, followed by upsampling to increase the feature map size, fused with features from the corresponding encoder stage. This process is repeated until the final layer is fused, resulting in the output feature map. This design effectively leverages the rich features of the encoder, providing strong support for the decoder.

The SCIMF-Net decoder significantly enhances multiscale feature fusion by using skip connections, effectively leveraging local and global contextual information. Additionally, incorporating batch normalization in the crucial 3 × 3 convolutional blocks accelerates the decoder’s training process and ensures stability, making the model more robust during training. To further improve generalization, dropout is employed, enhancing the model’s robustness and adaptability. These improvements endow the SCIMF-Net decoder with excellent performance and generalization capabilities.

4. Experiments

4.1. Ablation Experiments

As shown in Figure 8, the performance experiments of the SCIMF-Net model demonstrate that incorporating the original design modules significantly enhances training effectiveness. Using the improved SE-ResNeXt-101 as the backbone network initially resulted in relatively low training accuracy. Subsequently, by adding the PMFF module, training accuracy improved by approximately 3.58%, and training loss decreased by 0.076, significantly enhancing the model’s feature extraction capability. Finally, with the inclusion of the weighted Combined Loss function, training accuracy increased by an additional 6.85%, and training loss decreased by 0.119. These experimental results fully validate the effectiveness of each module and their positive contribution to the performance of the SCIMF-Net model.

This study utilizes a predicted dataset obtained from original Landsat remote sensing images to conduct ablation experiments, aiming to verify the effectiveness of the module combinations within the SCIMF-Net model. The ablation experiments for the three components of SCIMF-Net are presented in Table 3, where “×” indicates that the module is not used, and “√” indicates that the module is employed. Combination A represents the SE-ResNeXt-101 backbone network, B indicates the SE-ResNeXt-101 with the PMFF module, and C denotes the SE-ResNeXt-101 with the PMFF module along with the weighted combined loss function (Combined Loss).

This study optimizes land cover information extraction through three combinations. Combination A uses an improved SE-ResNeXt-101 backbone network with an attention mechanism to enhance the network’s contextual awareness and improve the representation of practical features. Combination B captures richer feature information from different scales to enhance understanding of various land cover types and effectively integrates multiscale information to improve spatial information comprehension. Experimental results show that after adding the PMFF module, the PA, MPA, and MIOU metrics increased by 0.65%, 0.1%, and 3.4%, respectively, indicating a continuous improvement in LULC information extraction accuracy across different land cover categories. Combination C incorporates the weighted Combined Loss function, which optimizes LULC information extraction accuracy across various land cover types more comprehensively by combining different types of losses. This effectively mitigates class imbalance, further enhancing the stability and accuracy of LULC information extraction, with PA, MPA, and MIOU metrics improving by 1.07%, 2.07%, and 1.95%, respectively.

To provide a more precise comparison of the specific effects of each module in the proposed network on LULC information extraction, a visual analysis was conducted to compare the secondary class information extracted by all modules, as illustrated in Figure 9 and Figure 10.

All combination methods achieved satisfactory results in extracting rice paddy (11) information. However, in extracting dry land (12) information, Combination A experienced some loss of dry land data. For forested land (21) information extraction, Combinations A and B encountered interference and misclassification issues. In extracting shrubland (22) information, Combination A failed to extract shrubland data successfully. For sparse forest (23) information extraction, Combination B did not effectively capture edge information. In extracting other forested land (24), Combination B exhibited significant errors in information extraction. Combination C demonstrated a more complete edge representation in extracting smaller other forested areas. All combination methods performed well in information extraction of high-cover grassland (31) and medium-cover grassland (32). However, in low-cover grassland (33), Combination A suffered severe information loss. In extracting highly imbalanced land cover types such as channels (41), both Combination A and Combination B were ineffective in extracting the continuous linear features of channels. At the same time, Combination C successfully captured all the information in the channel. In the extraction of lakes (42) and reservoirs (43), Combination A also showed detail loss and misclassification. All combination methods performed poorly in the extraction of wetlands (44). In the extraction of urban land (51), rural homesteads (52), and construction land (53), Combination A exhibited incomplete extraction and significant detail loss. For bare rock (61) information extraction, Combination B performed poorly. The results indicate that using only Combination A does not achieve high precision in LULC information extraction. However, as Modules B and C are incorporated, the extraction of more minor features becomes clearer, effectively addressing the issues of data loss and difficulty in accurately extracting minor land cover types due to class imbalance.

4.2. Comparison of SCIMF-Net with Other Convolutional Neural Networks

To fully validate the effectiveness of the proposed SCIMF-Net convolutional neural network, it was compared with classic networks such as U-Net++ [35], Res-FCN [36], U-Net [37], and SE-U-Net. As shown in Figure 11, under the same training conditions (150 epochs), SCIMF-Net achieved the highest training accuracy. Comparatively, SCIMF-Net’s training accuracy was 6.71% higher than Res-FCN, 5.58% higher than U-Net, 6.43% higher than SE-U-Net, and 4.34% higher than U-Net++. Additionally, when comparing the error between predicted and actual values (i.e., Loss), SCIMF-Net had the lowest loss, being 0.287 lower than Res-FCN, 0.192 lower than U-Net, 0.239 lower than SE-U-Net, and 0.126 lower than U-Net++. Throughout the training process, SCIMF-Net outperformed the other four convolutional neural networks in both training accuracy and loss, thus demonstrating its effectiveness.

As shown in Table 4, SCIMF-Net was comprehensively compared with other convolutional neural networks in terms of performance metrics for predicting data. For the per-class PA metric, SCIMF-Net demonstrated a significant advantage, outperforming Res-FCN, U-Net, SE-U-Net, and U-Net++ by 0.68%, 0.54%, 1.61%, and 3.39%, respectively. In the multiclass MPA comparison, SCIMF-Net also excelled, improving by 2.96%, 4.51%, 2.37%, and 3.45% over the aforementioned networks. To address class imbalance, this study used the MIOU as a more comprehensive and accurate evaluation metric. SCIMF-Net showed significant improvements in MIOU, surpassing Res-FCN, U-Net, SE-U-Net, and U-Net++ by 3.27%, 4.89%, 4.2%, and 5.68%, respectively. Through a comprehensive comparison of these three performance metrics, the proposed SCIMF-Net achieved better results in LULC information extraction under class imbalance, demonstrating its outstanding performance and advantages.

As shown in Figure 12, a comparison of SCIMF-Net with four other convolutional neural networks in extracting 17 secondary LULC categories reveals that SCIMF-Net achieves higher pixel accuracy in most tasks. Specifically, SCIMF-Net outperforms the other networks in extracting secondary categories such as paddy fields (11), shrubland (22), open forests (23), high-coverage grassland (31), medium-coverage grassland (32), low-coverage grassland (33), canals (41), reservoirs and ponds (43), rural homesteads (52), and construction land (53). In a few categories, it slightly underperforms specific networks: other forests (24) are slightly lower than Res-FCN; dryland (12), forested land (21), and urban land (51) are slightly lower than U-Net; beaches (44) and bare rock (61) are slightly lower than U-Net++. However, the accuracy gap with the best results remains small, underscoring SCIMF-Net’s effectiveness in LULC extraction tasks. Additionally, all networks show relatively low extraction accuracy for reservoirs and ponds (43), likely due to the complex features of this category that are difficult to capture accurately. U-Net++ has the lowest extraction accuracy for secondary LULC categories, possibly due to insufficient feature representation caused by class imbalance.

To compare the performance of the proposed network with other convolutional neural networks in extracting secondary LULC information, we visualized and compared the results of U-Net, U-Net++, Res-FCN, SE-U-Net, and the proposed SCIMF-Net, as shown in Figure 13 and Figure 14. In extracting paddy fields (11), U-Net and U-Net++ lost some edge contour information. For dryland (12), Res-FCN and SE-U-Net provided incomplete information. SE-U-Net showed misclassification in forested land (21) and shrubland (22). All networks performed well in extracting open forests (23). For other forests (24), Res-FCN had many misclassifications, and U-Net++ needed to extract complete contours. In high-coverage grassland (31), SE-U-Net and U-Net had poor accuracy due to class imbalance. SE-U-Net showed many misclassifications in medium-coverage grassland (32), and Res-FCN failed to extract low-coverage grassland. For canals (41), Res-FCN and U-Net could not extract continuous linear information. In lakes (42), U-Net and SE-U-Net lost some edge information. U-Net showed partial misclassification in reservoirs and ponds (43). All networks performed poorly in beach (44) extraction. U-Net needed to find some information on urban land (51). U-Net, SE-U-Net, and U-Net++ failed to extract complete contours in rural homesteads (52). Res-FCN, U-Net, and SE-U-Net had misclassifications in construction land (53). All networks performed well in extracting bare rock (61). These results indicate that traditional convolutional networks like Res-FCN, U-Net, SE-U-Net, and U-Net++ perform poorly in extracting small target information and are less effective with class imbalance in LULC information. In contrast, SCIMF-Net excels in extracting more minor LULC information and accurately handles class imbalance.

5. Discussion

From the comparison experiment between SCIMF-Net and U-Net, ResFCN, Unet++, and SE-U-Net, we can see that the original backbone network in the U-Net network mainly adopts the shallow structure of VGG16 [38]. Since remote sensing images often contain rich and delicate ground object information and many small targets, the shallow backbone network of VGG16 has difficulty effectively extracting these fine-grained semantic features. The network structure mainly uses feature splicing to fuse remote sensing image features, which makes it difficult to obtain fine details of ground objects and loses the deep information of target information in remote sensing images. It has poor performance when dealing with sample imbalance problems and has difficulty effectively extracting ground object categories with a small proportion. As an improved fully convolutional neural network, Res-FCN can recover the category of each pixel from the abstract features of remote sensing images and realize the extraction of pixel-level ground object category information. However, when extracting small target information, its receptive field is not enough to cover enough context information, resulting in the network only extracting local features, resulting in poor extraction of small target information. Due to the limitation of the original FCN structure [39], when dealing with the problem of sample imbalance, Res-FCN also finds it difficult to fully learn the features of fewer samples, thus affecting the accuracy of ground object information extraction. Although SE-U-Net borrows the idea of FPN network based on the U-Net structure and realizes the fusion of multiscale features, its feature fusion is still not sufficient when extracting small target information, resulting in the loss of details of small target information during the fusion process, and insufficient attention paid to the spatial position information of small targets, which affects the accuracy of small target information extraction. When dealing with the problem of sample imbalance, the loss function is not optimized in a targeted manner, resulting in insufficient learning of fewer category samples by the network model during the training process, making it difficult to extract less ground object category information. In the structure of U-Net++, the feature layers of the encoder and decoder networks are directly connected. This design simplifies the network architecture, but it leads to the loss of detailed information when fusing features of different scales, which in turn affects the results of small target information extraction. When dealing with the problem of sample imbalance, U-Net++ does not perform special optimization design on its network structure, which makes the information extraction effect of less common object categories poor, limiting the application performance of the model in complex remote sensing images.

In this paper, the backbone network used by SCIMF-Net is ResNeXt-101. As an architecture designed for image type classification, the original ResNeXt-101 network cannot be directly applied to LULC information extraction from remote sensing images [40]. Therefore, this paper makes targeted modifications to the original ResNeXt-101 network. First, by utilizing the powerful multiscale feature extraction capability of ResNeXt-101, a grouped convolution strategy is adopted to effectively extract low-level, medium-level and high-level features from remote sensing images. In order to further enhance the accuracy of feature extraction, the SE (squeeze-and-excitation) channel attention mechanism is integrated into the residual block, the basic structural block of ResNeXt, to obtain the improved SE-ResNeXt residual block. This improvement enables the SE-ResNeXt-101 backbone network to focus more on important small-scale feature channels while suppressing useless feature information, thereby improving the overall feature extraction capability. Compared with models that use shallow backbone networks such as U-Net, ResFCN, Unet++, and SE-U-Net, SE-ResNeXt-101, as a deeper network architecture, can significantly enhance the ability to extract LULC features in remote sensing images. In addition, the four stages of SE-ResNeXt-101 extract multiscale features of LULC features in remote sensing images, allowing SCIMF-Net to better adapt to LULC features of different sizes, thereby improving the overall performance and adaptability of the model. The PMFF module designed in this paper, through the mid- and high-level features extracted by the SE-ResNeXt-101 backbone network, strengthens the fusion between its spatial context information extraction and multiscale features, effectively improves the overall network’s understanding of local areas, enhances the coherence of global information, and fully utilizes the semantics and detail information of various types of objects by integrating features of different scales. Compared with the network structures of U-Net, ResFCN, Unet++, and SE-U-Net, the PMFF module enables SCIMF-Net to have an overall in-depth understanding of the spatial relationship and changes in small-scale object information in remote sensing images, helping the overall network to better obtain coarse-grained and fine-grained feature information in remote sensing images, and improving the accuracy of LULC information extraction at different scales. At the same time, the structural design of multimodule parallel computing is adopted to avoid and reduce the parameter redundancy and repeated calculation caused by a single module structure, and to improve the computing efficiency. In view of the imbalance of ground object categories, improving the loss function is one of the effective means to solve the imbalance of ground object categories when extracting LULC information from remote sensing images. In traditional LULC information extraction, most methods use a single loss function for training. There are three main types. One is the cross-entropy loss function, which does not consider the imbalance between the number of LULC categories. The cross-entropy loss function has difficulty learning the characteristics of fewer ground object categories and easily causes the risk of overfitting after the overall model training. Another is the focal loss loss function. The focal loss loss function effectively alleviates the LULC imbalance phenomenon by improving the cross-entropy loss function. However, the focal loss loss function has an unstable training process and is not effective for large differences within the LULC class. Another is the Dice loss loss function. The Dice loss loss function is suitable for dealing with the LULC imbalance problem, but Dice loss will also cause unstable gradient changes during model training, affecting model convergence, and resulting in insufficient performance optimization for ground object categories. In order to effectively solve the problem of unbalanced LULC distribution categories, in view of the shortcomings of the abovementioned single loss function, this paper adopts a weighted joint loss function to overcome the performance limitations of a single loss function. Compared with the single loss function used by U-Net, ResFCN, Unet++, and SE-U-Net, the weighted joint loss function reduces the overfitting risk of a single loss function, adjusts the weights of fewer ground object categories, enhances the effectiveness of feature extraction of fewer ground object categories, and improves the ability to extract poor information of fewer ground object categories caused by unbalanced ground object categories. Through ablation experiments, the effectiveness of the above three modules and their positive contribution to the performance of the SCIMF-Net model are fully demonstrated, effectively improving the overall performance of SCIMF-Net for feature extraction.

The study area of this experiment is the central Yunnan Urban Agglomeration in Yunnan Province, China, with an area of about 114,600 square kilometers. The remote sensing image datasets used are mainly from Landsat 5 TM in 2000 and 2010 and Landsat 8 OLI in 2020. The spatial resolution of these datasets is 30 m, which belongs to the medium-resolution category. SCIMF-Net, as an improved convolutional neural network model, shows the highest accuracy of LULC information extraction compared with the extraction results of U-Net, ResFCN, Unet++, and SE-U-Net. However, the SCIMF-Net proposed in this paper is a deep convolutional neural network, which has increased network depth and width compared with models such as U-Net, ResFCN, Unet++, and SE-U-Net. Although this design enables SCIMF-Net to achieve the highest accuracy in the experiment, it also brings higher computing resource requirements and larger storage space occupancy. Therefore, in terms of computing cost, SCIMF-Net does not have an advantage. In the future, in order to further improve the practicality of this model, it is necessary to carry out in-depth optimization work on it and focus on developing its lightweight structure to reduce computational complexity and storage space requirements.

As high-resolution remote sensing image acquisition technology becomes increasingly mature and convenient [41], the limitations of the global modeling capabilities of traditional CNN models in processing such images have become increasingly prominent. In order to break through this bottleneck, it is necessary to further improve the uniqueness of high-resolution remote sensing images. Future research work will focus on developing and applying more advanced network model structures, aiming to achieve significant performance improvements in the field of high-resolution remote sensing image processing. At the same time, with the continuous innovation and development of deep learning and artificial intelligence technologies, the RS-Mamba model [42] has shown significant advantages over traditional methods in terms of computational efficiency and LULC information extraction accuracy. With its stronger linear complexity and excellent global modeling capabilities, the model has shown extremely high applicability in processing LULC information extraction tasks for large-scale remote sensing images. This feature not only greatly broadens the application scope of remote sensing technology, but also points out the direction for future research. Future research should further explore and optimize the application potential of the remote sensing Mamba model in large-scale, high-precision LULC information extraction, in order for it to play a greater role in key areas such as land use monitoring, and continuously promote the integrated development of remote sensing technology and deep learning.

6. Conclusions

Given the limitations of CNN in extracting LULC information from remote sensing images, such as the difficulty in extracting small targets and extracting information about objects with fewer categories when the object categories are unbalanced, this paper proposes a new SCIMF-Net network. Improving the ResNeXt-101 structure and adding a channel attention mechanism enhances the feature expression ability of small-scale object information. To deeply understand the global and local spatial context information, SCIMF-Net designs a parallel multiscale feature extraction fusion unit PMFF, which realizes the comprehensive extraction of remote sensing image information with smaller object scales and improves the ability to understand context information. Given the problem of unbalanced object categories, SCIMF-Net designs a weighted joint loss function composed of the cross-entropy loss function, focal loss function, and Dice loss, which improves the ability to extract fewer object categories. Ablation experiments verify the effectiveness of different modules, and, compared with the mainstream CNN network method, it is verified that SCIMF-Net is advanced in solving the problem of unbalanced object categories and extracting information about small targets.

Author Contributions

Shuangyun Peng supervised the entire process; and Xue Ding, Zhaoqian Wang, Shuangyun Peng, Xin Shao, and Ruifang Deng contributed to all aspects of this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Major Project of Yunnan Province (Science and Technology Special Project of Southwest United Graduate School-Major Projects of Basic Research and Applied Basic Research): Vegetation change monitoring and ecological restoration models in Jinsha River Basin mining area in Yunnan based on multi-modal remote sensing (grant no. 202302AO370003), Basic Research Project of Yunnan Province, Project name: Identification of high altitude Remote geological hazards in mountainous areas of Western Yunnan based on Multi-source Remote Sensing Technology—A case study of the Ratuti River Basin (no. 202301AT070173), Remote sensing estimation of aboveground carbon sink of vegetation in the central Yunnan urban agglomeration and its response to climate change and human activities, supported by Yunnan Fundamental Research Projects (grant no. 202401AT070103).

Data Availability Statement

Data are contained within the article.

Acknowledgments

This work was supported by The Key Laboratory of Remote Sensing in Higher Education Institutions in Yunnan Province and the Yunnan Provincial Engineering Research Center for Geographic Spatial Information. Thanks to the support of these organizations that made this research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mayer-Svhönberger, V.; CuKier, K. Big Data: A Revolution That Will Transform How We Live, Work and Think; Houghton Mifflin Harcourt Publishing Company: New York, NY, USA, 2013. [Google Scholar]
Runting, R.K.; Phinn, S.; Xie, Z.Y.; Venter, O.; Watson, J.E.M. Opportunities for big data in conservation and sustainability. Nat. Commun. 2020, 11, 2003. [Google Scholar] [CrossRef] [PubMed]
Guo, H.; Wang, L.; Chen, F.; Liang, D. Scientific big data and Digital Earth. Chin. Sci. Bull. 2014, 59, 1047–1054. [Google Scholar] [CrossRef]
Liu, J.Y.; Zhang, Z.X.; Zhang, S.W.; Yan, C.; Wu, S.; Li, R.; Kuang, W.H.; Shi, W.J.; Huang, L.; Ning, J.; et al. Innovation and development of remote sensing-based land use change studies based on Shupeng Chen’s academic thoughts. J. Geo-Inf. Sci. 2020, 22, 680–687. [Google Scholar]
Luo, J.C.; Wu, T.J.; Wu, Z.F.; Zhou, Y.; Gao, L.; Sun, Y.; Wu, W.; Yang, Y.; Hu, X.; Zhang, X.; et al. Methods of intelligent computation and pattern mining based on geo-parcels. J. Geo-Inf. Sci. 2020, 22, 57–75. [Google Scholar]
Cira, C.I.; Manso-Callejo, M.A.; Alcarria, R.; Fernandez -Pareja, T.; Bordel-Sanchez, B.; Serradilla, F. Generative Learning for Postprocessing Semantic Segmentation Predictions: A Lightweight Conditional Generative Adversarial Network Based on Pix2pix to Improve the Extraction of Road Surface Areas. Land 2021, 10, 79. [Google Scholar] [CrossRef]
Cira, C.I.; Kada, M.; Manso-Callejo, M.Á.; Alcarria, R.; Bordel -Sanchez, B. Improving Road Surface Area Extraction via Semantic Segmentation with Conditional Generative Learning for Deep Inpainting Operations. ISPRS Int. J. Geo-Inf. 2022, 11, 43. [Google Scholar] [CrossRef]
Li, Z.H.; Chen, H.; Jing, N.; Li, J. RemainNet: Explore Road Extraction from Remote Sensing Image Using Mask Image Modeling. Remote Sens. 2023, 15, 4215. [Google Scholar] [CrossRef]
Sheng, J.J.; Sun, Y.Q.; Huang, H.; Xu, W.; Pei, H.; Zhang, W.; Wu, X. HBRNet: Boundary Enhancement Segmentation Network for Cropland Extraction in High-Resolution Remote Sensing Images. Agriculture 2022, 12, 1284. [Google Scholar] [CrossRef]
Nair, S.; Sharifzadeh, S.; Palade, V. Farmland Segmentation in Landsat 8 Satellite Images Using Deep Learning and Conditional Generative Adversarial Networks. Remote Sens. 2024, 16, 823. [Google Scholar] [CrossRef]
Wuttichai, B.; Yumin, T.; Bo, X. Deep learning-based multi-feature semantic segmentation in building extraction from images of UAV photogrammetry. Int. J. Remote Sens. 2021, 42, 1–19. [Google Scholar]
Samaneh, V.M.; Abdolhossein, F.; Kaveh, M. Grsnet: Gated residual supervision network for pixel-wise building segmentation in remote sensing imagery. Int. J. Remote Sens. 2022, 43, 4872–4887. [Google Scholar]
Shen, Q.; Huang, J.R.; Wang, M.; Tao, S.; Yang, R.; Zhang, X. Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 78–94. [Google Scholar] [CrossRef]
Sun, D.C.; Gao, G.; Huang, L.J.; Liu, Y.; Liu, D. Extraction of water bodies from high-resolution remote sensing imagery based on a deep semantic segmentation network. Sci. Rep. 2024, 14, 14604. [Google Scholar] [CrossRef] [PubMed]
Zhong, H.F.; Sun, H.M.; Han, D.N.; Li, Z.H.; Jia, R.S. Lake water body extraction of optical remote sensing images based on semantic segmentation. Appl. Intell. 2022, 52, 17974–17989. [Google Scholar] [CrossRef]
da Costa, L.B.; de Carvalho, O.L.; de Albuquerque, A.O.; Gomes, R.A.; Guimarães, R.F.; de Carvalho Júnior, O.A. Deep semantic segmentation for detecting eucalyptus planted forests in the Brazilian territory using sentinel-2 imagery. Geocarto Int. 2022, 37, 6538–6550. [Google Scholar] [CrossRef]
Gokul, P.; Ujjwal, V. Texture based prototypical network for few-shot semantic segmentation of forest cover: Generalizing for different geographical regions. Neurocomputing 2023, 538, 126201. [Google Scholar]
Deng, Z.K.; Wang, T.; Zhao, X.X.; Zhou, Z.; Dong, J.J.; Niu, J.M. Extracting Spatial Distribution Information of Alfalfa Artificial Grassland Based on Deep Learning Method. Chin. J. Grassl. 2023, 45, 22–33. [Google Scholar]
Gao, H.Y.; Gao, X.H.; Feng, Q.S.; Li, W.L.; Lu, Z.; Liang, T.G. Approach to plant species identification in natural grasslands based on deep learning. Pratacult. Sci. 2020, 37, 1931–1939. [Google Scholar]
Wang, J.; Gao, S.; Guo, L.; Wang, Y. Impervious Surface Extraction from High-resolution Images based on Multi-scale Feature Fusion in U-Net Network. Remote Sens. Technol. Appl. 2022, 37, 811. [Google Scholar] [CrossRef]
Cai, B.W.; Wang, S.G.; Wang, L.; Shao, Z.F. Extraction of urban impervious surface from high-resolution remote sensing imagery based on deep learning. J. Geo-Inf. Sci. 2019, 21, 1420–1429. [Google Scholar]
Luo, J.C.; Hu, X.D.; Wu, T.J.; Liu, W.; Xia, L.G.; Yang, H.P.; Sun, Y.W.; Xu, N.; Zhang, X.; Shen, Z.; et al. Research on intelligent calculation model and method of precision land use/cover change information driven by high-resolution remote sensing. Natl. Remote Sens. Bull. 2021, 25, 1351–1373. [Google Scholar] [CrossRef]
Tian, Z.H.; Chang, P.; He, X.H.; Cheng, X.J. Land cover classification of high resolution remote sensing images based on CNN-GCN. Sci. Surv. Mapp. 2023, 48, 59–72. [Google Scholar]
Gu, Y.H.; Hao, J.; Chen, B. Semi-supervised Semantic Segmentation for High-resolution Remote Sensing Images Based on Data Fusion. Comput. Sci. 2023, 50 (Suppl. S1), 266–271. [Google Scholar]
Chen, J.H.; Wang, X.L. Multi-graph convolutional network for a remote sensing image few shot classification. Natl. Remote Sens. Bull. 2022, 26, 2029–2042. [Google Scholar] [CrossRef]
Li, J.J.; Liu, Z.Q.; Song, R.; Li, Y.S. Algorithm for segmentation of remote sensing imagery using the improved Unet. J. Xidian Univ. 2022, 49, 67–75+128. [Google Scholar]
Zhang, Y.; Zhang, D.; Zhao, W.Q.; Du, X.G.; Lei, T. Research on Remote Sensing Image Segmentation Method Based on Lightweight U-shaped Network. Comput. Digit. Eng. 2022, 50, 2053–2058. [Google Scholar]
Huang, J.Q.; Weng, L.G.; Chen, B.Y.; Xia, M. DFFAN: Dual Function Feature Aggregation Network for Semantic Segmentation of Land Cover. ISPRS Int. J. Geo-Inf. 2021, 10, 125. [Google Scholar] [CrossRef]
Yang, X.; Chen, Z.C.; Zhang, B.; Li, B.P.; Bai, Y.Q.; Chen, P. A Block Shuffle Network with Superpixel Optimization for Landsat Image Semantic Segmentation. Remote Sens. 2022, 14, 1432. [Google Scholar] [CrossRef]
Wang, H.J.; Qiao, L.Q.; Li, H.; Li, X.J.; Li, J.H.; Cao, T.; Zhang, C.Y. Remote sensing image semantic segmentation method based on small target and edge feature enhancement. J. Appl. Remote Sens. 2023, 17, 044503. [Google Scholar] [CrossRef]
Sun, S.D.; Xia, M.; Dai, T.F. Controllable Fused Semantic Segmentation with Adaptive Edge Loss for Remote Sensing Parsing. Remote Sens. 2022, 14, 207. [Google Scholar] [CrossRef]
Yang, X.; Fan, X.; Peng, M.J.; Guan, Q.; Tang, L. Semantic segmentation for remote sensing images based on an AD-HRNet model. Int. J. Digit. Earth 2022, 15, 2376–2399. [Google Scholar] [CrossRef]
Xie, S.N.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018, Proceedings 4; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29. Available online: https://proceedings.neurips.cc/paper_files/paper/2016 (accessed on 28 October 2024).
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Seferbekov, S.; Iglovikov, V.; Buslaev, A.; Shvets, A. Feature pyramid network for multi-class land segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 272–275. [Google Scholar]
Chen, S.L.; Ogawa, Y.; Zhao, C.; Sekimoto, Y. Large-scale individual building extraction from open-source satellite imagery via super-resolution-based instance segmentation approach. ISPRS J. Photogramm. Remote Sens. 2023, 195, 129–152. [Google Scholar] [CrossRef]
Yang, X.; Li, S.S.; Chen, Z.C.; Chanussot, J.; Jia, X.P.; Zhang, B.; Li, B.P.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]
Zhao, S.J.; Chen, H.; Zhang, X.L.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]

Figure 1. Example of land cover types and semantic labels in Landsat 5 TM remote sensing imagery.

Figure 2. Example of land cover types and semantic labels in Landsat 8 OLI remote sensing imagery.

Figure 3. Data augmentation methods.

Figure 4. Detailed structure of SCIMF-Net.

Figure 5. SE-ResNeXt residual block.

Figure 6. Parallel multiscale feature extraction and fusion module (PMFF).

Figure 7. Structure of the SCIMF-Net decoder.

Figure 8. Comparison of training accuracy and loss for each module of SCIMF-Net.

Figure 9. Comparison of SCIMF-Net ablation experiment results for level 2 classes 11–33.

Figure 10. Comparison of SCIMF-Net ablation experiment results for level 2 classes 41–61.

Figure 11. Comparison of training accuracy and loss between SCIMF-Net and other convolutional neural networks.

Figure 12. Comparison of extraction accuracy for different land cover types between SCIMF-Net and other convolutional neural networks.

Figure 13. Visualization comparison of SCIMF-Net with other convolutional neural networks for LULC information extraction at secondary classes 11–33.

Figure 14. Visualization comparison of SCIMF-Net with other convolutional neural networks for LULC information extraction at secondary classes 41–61.

Table 1. Classification standards for LULC remote sensing information extraction in the Central Yunnan Urban Agglomeration.

Primary Category Code	Primary Category Name	Secondary Category Code	Secondary Category Name
1	Cultivated Land	11	Paddy Field
1	Cultivated Land	12	Dry Land
2	Forest Land	21	Forested Land
		22	Shrubland
		23	Sparse Forest
		24	Other Forested Land
3	Grassland	31	High Coverage Grassland
		32	Medium Coverage Grassland
		33	Low Coverage Grassland
4	Water Bodies	41	Rivers and Canals
		42	Lakes
		43	Reservoirs and Ponds
		44	Floodplains
5	Urban and Rural Land	51	Urban Land
		52	Rural Homesteads
		53	Construction Land
6	Unutilized Land	61	Bare Rock and Stony Land

Table 2. Parameters of the SE-ResNeXt-101 network.

Stages	Operation	Stride	Channels	Layers
Conv1	Convolution 3 × 3	1	4 * in_channels	1
Conv2	Convolution 3 × 3	1	64	1
Maxpooling	3 × 3	1	64	1
Stage1	SE-ResNextBlock SE = 1/16	3	256	3
Stage2	SE-ResNextBlock SE = 1/16	3	512	3
Stage3	SE-ResNextBlock SE = 1/16	23	1024	2
Stage4	SE-ResNextBlock SE = 1/16	3	2048	3

Table 3. Comparison of ablation experiment results for each module of SCIMF-Net.

Combination Strategy	SE-ResNeXt-101	PMFF	Combined Loss	PA	MPA	MIOU
A	√	×	×	83.53	78.37	62.75
B	√	√	×	84.18	78.47	66.15
C	√	√	√	85.25	80.54	68.1

Table 4. Performance metrics comparison of SCIMF-Net and other convolutional neural networks.

Comparison Networks	PA	MPA	MIOU
Res-FCN	90.17	84.62	74.65
U-Net	90.31	83.07	73.03
SE-U-Net	89.24	85.21	73.72
U-Net++	87.46	84.13	72.24
SCIMF-Net	90.85	87.58	77.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, X.; Wang, Z.; Peng, S.; Shao, X.; Deng, R. Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks. ISPRS Int. J. Geo-Inf. 2024, 13, 386. https://doi.org/10.3390/ijgi13110386

AMA Style

Ding X, Wang Z, Peng S, Shao X, Deng R. Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks. ISPRS International Journal of Geo-Information. 2024; 13(11):386. https://doi.org/10.3390/ijgi13110386

Chicago/Turabian Style

Ding, Xue, Zhaoqian Wang, Shuangyun Peng, Xin Shao, and Ruifang Deng. 2024. "Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks" ISPRS International Journal of Geo-Information 13, no. 11: 386. https://doi.org/10.3390/ijgi13110386

APA Style

Ding, X., Wang, Z., Peng, S., Shao, X., & Deng, R. (2024). Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks. ISPRS International Journal of Geo-Information, 13(11), 386. https://doi.org/10.3390/ijgi13110386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks

Abstract

1. Introduction

2. Related Work

2.1. The Landsat Remote Sensing Images and Labeled Samples of the Central Yunnan Urban Agglomeration

2.2. Deep Learning Experimental Environment and Evaluation Metrics Experimental Environment

3. Method

3.1. Overview

3.2. Encoder of SCIMF-Net

3.3. Parallel Multiscale Feature Extraction Fusion Module (PMFF)

3.4. Weighted Joint Loss Function

3.5. SCIMF-Net Decoder

4. Experiments

4.1. Ablation Experiments

4.2. Comparison of SCIMF-Net with Other Convolutional Neural Networks

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI