1. Introduction
With the advancement of satellite launch technology in recent years, satellite remote sensing has reduced the costs in addition to its own advantages of wide coverage and fast response time, making it more widely applied in various areas. In order to detect spatial information changes in remote sensing images, it is necessary to classify and interpret the images, and remote sensing image segmentation plays a critical role in this task due to the low spatial resolution of satellite images [
1]. Remote sensing image segmentation in the past five years has been applied in forestry [
2], hydrology [
3], environmental protection [
4], and meteorology [
5,
6,
7,
8]. These studies stress that the segmentation performance will greatly influence the final interpretation results [
9]. Therefore, it is particularly significant to concentrate on remote sensing image segmentation methods.
Traditional image segmentation methods can be divided into threshold-based segmentation methods, edge-based segmentation methods, region-based segmentation methods, and graph-based segmentation methods [
7]. Threshold-based segmentation methods usually set thresholds according to the results of band operations or common feature indexes in the image, and then classify the pixels into the appropriate categories [
10]. Edge-based segmentation methods often detect changes in image grayscale values and feature indexes, which are manifestations of discontinuities in the local features of the image, resulting in edges between different regions in the image [
11]. In the framework of mathematical morphology theory, watershed transformation is a method that is often used for edge segmentation [
12]. This algorithm treats two-dimensional images as the elevation data and determines region boundaries by simulating the process of flooding. As for region-based segmentation algorithms, they focus on the similarity within regions to distinguish different regions. The region growing method [
13] starts with the selection of seed pixels, joins neighboring pixels according to the similarity criterion, and conducts iteration until the entire region is formed. The region segmentation and merging algorithm first segments the image into multiple sub-regions and then merges the image according to the properties of the sub-regions [
14,
15]. The graph-based segmentation method maps the image as a weighted, undirected graph. In the graph, the weights on each edge indicate the differences between pixels, and the segmentation of the image is achieved by cutting and removing specific edges. The principle of segmentation is to maximize similarity within the subgraphs and minimize similarity between the subgraphs [
16,
17]. A mixture of the above methods can also be used, such as extracting initial segments using an edge algorithm and merging similar segments using a region-based algorithm [
18], which achieves the purpose of considering both the boundary information between regions and internal spatial information.
In traditional remote sensing image segmentation, the standardized spectral indicators, such as the Normalized Difference Water Index (NDWI), the Normalized Different Vegetation Index (NDVI), the Normalized Difference Built-up Index (NDBI) and so on, are usually utilized as the feature data based on different indicator combinations and threshold ranges for different target detections. However, remote sensing images have multispectral channels, rich data, and complex backgrounds, and the segmentation effectiveness of traditional remote sensing image segmentation methods will be limited due to the lack of better utilization of these features to further develop remote sensing information.
Semantic segmentation methods based on deep learning classify images pixel by pixel and achieve a better performance in natural image segmentation. The basic framework for many semantic segmentation studies drew on the experience of Long. et al. (2014). They proposed the full convolutional network (FCN) [
19], a network framework that combines classification architectures, such as AlexNet, VGG-16, and GoogLeNet, which can be trained end-to-end for any size input image and can efficiently make dense predictions for per-pixel tasks such as semantic segmentation. The Deeplab series based on FCN, unveiled by Chen L.C.et al., tackles problems of encoding multi-scale information and sharpening segmented output by pooling techniques or filters. Deeplab-v1 improved the segmentation localization accuracy by adding a fully connected conditional random field (CRF) [
20], but it was more computationally expensive until Deeplab-v2 adopted a new atrous convolution for sampling and used the residual network, Resnet, as a downsampling structure to increase the model fitting ability [
21]. Deeplab-v3 developed the use of atrous convolution and improved the atrous spatial pyramid pooling (ASPP) module to enhance the ability to capture context [
22]. Integrating the advantages of the previous Deeplab, v3+ applies Xception as a new backbone network to make overall predictions based on multiple scales of the same image and to improve feature resolution [
23]. Another multi-scale and pyramid network-based model, PSPNet [
24], proposed and added a pyramid pooling module to the FCN framework to improve the segmentation performance for contextually complex scenes and small targets as well as the convergence speed of the model. In addition to the FCN-based model, the U-Net series is also an encoder–decoder architecture for semantic segmentation. U-Net [
25] solved the problem of training small datasets by encoding–decoding U-shaped structures and extended the research of many models with good segmentation effects, such as UNet++, Attention U-Net, etc. SegNet follows the U-shaped structure and adds the max-pooling operation, which reduces the number of parameters for end-to-end training and can be more easily merged into other U-shaped structures [
26].
The semantic segmentation method based on deep learning can well fit the characteristics of remote sensing image segmentation tasks with large data volume and complex backgrounds, but compared with the natural ones, remote sensing images have a larger image size and the proportion of targets to be segmented is smaller, which brings about the foreground–background imbalance problem. In addition, the scale difference of different categories of targets in remote sensing images is huge, which brings the problem of inter-category imbalance of the foreground–foreground. The two imbalances mentioned above will make the deep neural network more advantageous to segment the target categories with more pixels, thus weakening the segmentation ability for the categories with few pixels, which finally degrades the segmentation accuracy of the model and causes the information interpretation failure of remote sensing images.
To address this problem, there is currently some related research in the field of remote sensing image segmentation. A combined sampling method was proposed to solve the class imbalance problem of feature segmentation in the Tibetan plateau region from the perspective of sample resampling [
27]. The Deeplab-v3+ model was put forward, which encodes multi-scale contextual information by coarse convolution to enhance the effect of unbalanced data segmentation [
6]. A new variant of the Dice loss named Tanimoto was presented, which speeds up the convergence of training and performs well with severely unbalanced aerial datasets [
28]. Audrey et al. (2020) demonstrated that tree species classification using parametric algorithms by combining Canopy Height Model (CHM) data, spectral data, and height data fused with non-parametric classification is applicable to unbalanced binary classifications. A novel synthetic minority oversampling technique-based rotation forest algorithm for the classification of imbalanced hyperspectral image data was also proposed [
29].
In the study of natural images, the problem of extreme imbalance in the sample data is usually found in a variety of task scenarios in target detection, image classification, and instance segmentation [
30,
31,
32]. Ref. [
33] proposed that the classification performance due to class imbalance will deteriorate with the increasing ratio between the majority and the minority classes. To solve this problem, common methods of deep learning can be classified into three categories: class-rebalancing, information enhancement, and module improvement [
30]. Re-weighting methods rebalance the categories by adjusting the loss values of different categories during training [
34]. Ref. [
35] applied a two-stage training model, where the weights of a larger number of categories were reduced in the second stage based on sample gradient changes. Ref. [
36] trained an a priori model in the first stage and reweighted the whole model in the second stage using the Kullback–Leibler divergence. The two-stage reweighting approach has more room for adjustment, but it is slower and not beneficial for model deployment and application. The balanced meta-softmax [
37] optimizes the model classification performance by learning the optimal sample distribution parameters on a balanced metadata set. The label distribution disentangling (LADE) method introduces a label distribution separation loss, meaning that a balanced distribution is separated from an unbalanced dataset, which allows the model to be adapted to an arbitrary test class distribution when the test label frequency is available [
38]. Meta-Weight-Net [
38] designs a functional mapping from training losses to sample weights, followed by multiple iterations of weight computation and classifier updates. Guided by a small number of unbiased metadata, the parameters of the weighting function could be fine-tuned and updated in parallel with the learning process of the classifier.
Notwithstanding the effectiveness of these methodologies using existing balanced datasets, the imbalance of remote sensing images is inherent in every image, making it difficult to build a suitable balanced dataset. The Dual Focal Loss (DFL) function modified the loss scaling method of the Focal Loss to improve the classification accuracy of the unbalanced classes in a dataset by solving the problem of the vanishing gradient [
39]. Ref. [
40] proposed a one-stage class balance reweighting method based on the effective sample space. This one-stage method combined with Focal loss [
41] and CE loss (cross entropy loss function) achieved good results in the extremely unbalanced task of image classification without a priori balanced datasets. However, in the existing dynamic weighting algorithms for solving the extreme imbalance problem, although the effect of very small class segmentation is improved, it also reduces the overall segmentation accuracy [
37]. In addition, the effective sample space in semantic segmentation tasks has not yet been defined and studied, and the relevant hyperparameters have not yet been proposed for more applicable computation methods.
In this paper, for the semantic segmentation of remote sensing images, the division of the majority and the minority categories is achieved by studying the effective sample space in the dataset. A Dynamic Effective Class Balance (DECB) weighting method based on the number of effective samples is proposed for the first time. As the most popular category in remote sensing image segmentation research, a publicly available LULC remote sensing image dataset and a self-constructed forest fire burning area dataset were made a validation. The experimental results demonstrate the effectiveness of the DECB method in remote sensing image segmentation and the highlighting of minimal classes without sacrificing the overall segmentation effect.
The main parts of this paper are structured as follows:
Section 2 introduces the datasets used in this paper, including the self-built forest fire burning area dataset and the unbalanced datasets constructed from the publicly available land-cover segmentation dataset.
Section 3 proposes a method for calculating the number of effective samples in semantic segmentation and a DECB weighting algorithm.
Section 4 applies the algorithm to LULC and burning area segmentation experiments and analyses the experimental results.
Section 5 draws the conclusion.
5. Conclusions
The image segmentation results based on deep learning are greatly affected by the existence of highly unbalanced data among various categories in the remote sensing dataset. To solve this problem, the following recommendations are made in this paper: Firstly, the corresponding datasets are established, including a tri-classified, extremely unbalanced forest fire burning area segmentation dataset and two highly unbalanced segmentation datasets from a publicly available dataset. Secondly, a method for computing effective samples in the semantic segmentation task and a weighting method for dynamic effective class balancing are proposed to solve the class imbalance problem in multi-category semantic segmentation. Finally, the effectiveness and robustness of the method are verified experimentally.
The results show that the DECB method can improve minority class segmentation in the semantic segmentation task by combining Focal_loss and CE in a U-Net network architecture with vgg and Resnet-50 as different encoders, respectively. In the publicly available LoveDA-rural and LoveDA-r-road datasets, the average IOU of very small class segmentation results increased by approximately 1%, and the overall average cross-merge ratio also increased due to changes in class balance. In the forest fire burning area dataset, the maximum increase in the mean IOU for forest fire pixel segmentation was about 4%, and the recall increased by approximately 20%, which is more advantageous in the forest fire burning area segmentation task. Meanwhile, the DECB method proposed in this paper can effectively improve the segmentation effect of the minimum classes without sacrificing overall accuracy. Meanwhile, the DECB method proposed in this paper can effectively improve the segmentation effect of the minimum classes without sacrificing overall accuracy.
However, there are still some issues that need to be addressed in further research. The quantitative imbalance relationship between the various categories in a single image or a single batch is not exactly consistent with the dataset itself, which is the fundamental reason why the data in the sample space cannot be distributed as evenly as ideal.