1. Introduction
Forests play a vital role in the land ecosystem of the Earth. They are indispensable for conserving biodiversity, protecting watersheds, capturing carbon, mitigating climate change effects [
1,
2], maintaining ecological balance, regulating rainfall patterns, and ensuring the stability of large-scale climate systems [
3,
4]. As a result, the timely and precise monitoring and mapping of forest cover has emerged as a vital aspect of sustainable forest management and the monitoring of ecosystem transformations [
5].
Traditionally, the monitoring and mapping of forest cover has primarily relied on field research and photo-interpretation techniques. However, these methods are limited by the extensive manpower required. With the advancements in remote sensing (RS) technology, the acquisition of large-scale, high-resolution forest imagery data has become possible without the need for physical contact and without causing harm to the forest environment. Taking advantage of RS imagery, numerous studies have proposed various methods for forest cover mapping, including decision trees [
6], regression trees [
7], maximum likelihood classifiers [
8], random forest classification algorithms [
9,
10], support vector machines, spatio-temporal Markov random-field super-resolution mapping [
11], and multi-scale spectral–spatial–temporal super-resolution mapping [
12].
Recently, due to the growing prevalence of deep Convolutional Neural Networks (CNNs) [
13] and semantic segmentation [
14,
15,
16], there has been a notable shift in the research community towards utilising these techniques for forest cover mapping with RS imagery. CNNs have emerged as powerful tools for analysing two-dimensional images, employing their multi-layered convolution operations to effectively capture low-level spatial patterns (such as edges, textures, and shapes) and extract high-level semantic information. Meanwhile, semantic segmentation techniques enable the precise identification and extraction of different objects/regions in an image by classifying each image pixel into a specific semantic category, achieving pixel-level image segmentation. In the context of forest cover mapping, several existing methods have demonstrated the effectiveness of using semantic segmentation techniques. Bragagnolo et al. [
17] proposed to integrate an attention block into the basic UNet network to segment the forest area using satellite imagery from South America. Flood et al. [
18] also proposed a UNet-based network and achieved promising results in mapping the presence or absence of trees and shrubs in Queensland, Australia. Isaienkov et al. [
19] directly employed the baseline U-Net model combined with Sentinel-2 satellite data to detect changes in Ukrainian forests. However, all these methods rely on fully supervised learning for semantic segmentation, which necessitates a substantial amount of labelled pixel data, resulting in a significant labelling expense. Semi-supervised learning [
20,
21,
22] has emerged as a promising approach to address the aforementioned challenges. It involves training models using a combination of limited labelled data and a substantial amount of unlabelled data, which reduces the need for manual annotations while still improving the performance of the model. Several research studies [
23,
24] have explored the application of semi-supervised learning in semantic segmentation for land-cover-mapping tasks in RS. While these studies have assessed the segmentation of forests to some extent, their focus has predominantly been on forests situated in urban or semi-natural areas, limiting their performance in densely forested natural areas. Moreover, there are unique challenges associated with utilising satellite RS imagery specifically for forest cover mapping:
Challenge 1: As illustrated in
Figure 1a, satellite remote sensing (RS) forest images often face problems such as variations in scene illumination and atmospheric interference during the image acquisition process. These factors can lead to colour deviations and distortions, resulting in poor colour fidelity and low contrast. Therefore, it becomes essential to employ image enhancement techniques to improve visualisation and reveal more details. This enhancement facilitates the ability of CNNs to effectively capture spatial patterns.
Challenge 2: Due to the high density of natural forest cover and the similar reflectance characteristics between forest targets and other non-forest targets, e.g., grass and the shadow of vegetation, the boundaries between forest and non-forest areas often become unclear. As a result, it becomes challenging to accurately distinguish and delineate the details and edges of the regions of interest, as depicted in
Figure 2a.
Challenge 3: For unknown forest distributions, there are two scenarios: imbalanced (as illustrated in
Figure 2) or balanced datasets. Current methods face challenges in effectively handling datasets with different distributions, resulting in poor model generalisation.
In this study, we have undertaken a pioneering endeavour to integrate semi-supervised learning techniques into forest cover mapping using satellite RS imagery. We propose a novel semi-supervised semantic segmentation network called Semi-FCMNet, designed to effectively tackle the associated challenges in this task. To tackle challenge 1, which encompasses image distortion issues in RS images, we employ the concept of multi-level perturbations. Perturbations are designed at different stages, namely, at the input level, feature level, and model level. At the input level, perturbations utilising mixed augmentation techniques are employed to enhance the representation of forest features within the image. The feature-level and model-level perturbations facilitate model learning and capture valuable information during training. Importantly, these perturbations effectively counteract the overfitting of the model to noise. For challenge 2, we combine the auxiliary teacher module with the Test-Time Augmentation (TTA) approach to integrate the generated pseudo-labels through voting and multi-scale fusion, enhancing the reliability and clarity of edge information. To address challenge 3, we introduce an adaptive loss function that automatically focuses on under-learned classes and adjusts the attention towards labels generated by both the student and teacher models. This approach enables our model to achieve excellent performance on both balanced and imbalanced datasets. The adaptive loss function effectively addresses the issue of insufficient learning in certain classes and improves the capability of the model to handle diverse data distributions. Furthermore, we adopt a progressive learning approach and design a data augmentation strategy from easy to difficult, employing different intensity levels of data augmentation for models at different stages. The code is publicly available at
https://github.com/baizegugugu/Semi-FCMNet.
The primary contributions of this paper are summarised as follows:
We have designed the multi-level perturbation (MP) module, including input-, feature- and model-level perturbations at different module stages. The proposed approach incorporates perturbations at the input stage to enhance forest representation features by using mixed augmentation. Additionally, the auxiliary teacher module introduces perturbations at both the feature and model levels, allowing the model to concentrate on feature disparities and proficiently learn forest characteristics while effectively mitigating the overfitting problem to noise.
By integrating auxiliary teachers with the student model, the basic self-training method was enhanced. To generate more stable and reliable pseudo-labels during the pseudo-labelling phase, we introduced a novel ensemble voting (EV) module, smoothing the decision-making process for challenging boundary regions. This module leverages a combination of multiple models and adopts a strategy based on TTA and multi-model voting.
We have developed a simple yet effective adaptive loss (AL) that enables the model to adapt to both balanced and imbalanced data distributions while also increasing its focus on labels generated by the teacher. By incorporating AL into the training process, our model demonstrates robust performance across different data scenarios.
This paper is organised as follows.
Section 2 reviews semi-supervised semantic segmentation, and
Section 3 provides a detailed description of the proposed framework.
Section 4.1 presents detailed information on the data distribution of the two datasets we used (Atlantic Forest and Amazon Forest), while
Section 4.2 and
Section 4.3 introduce the relevant parameters set in our experiments and the metrics used to verify the experimental results, respectively.
Section 4.4 compares and analyses our methods with the SOTA methods on the two datasets, while
Section 4.5 presents the ablation experimental results for our method to validate its effectiveness. Finally,
Section 5 outlines the limitations of our method, future research directions, and application prospects.
3. Method
3.1. Problem Definition
Semi-supervised semantic segmentation is a method of semantic segmentation that uses labelled and unlabelled data. Compared to traditional fully supervised semantic segmentation methods, it does not require a large amount of labelled data to train the model, making computation more economical and efficient. The main idea of semi-supervised semantic segmentation is to enhance the generalisation ability of the model by utilising unlabelled data. Specifically, semi-supervised semantic segmentation seeks to generalise from a combined dataset consisting of pixel-wise labelled images
unlabelled images
, where, typically,
. In the majority of studies, the overall optimisation objective is formulated as
where
serves as a tradeoff between labelled and unlabelled data. The parameter
can either be a fixed value or be scheduled during training. The unsupervised loss
is a crucial aspect that distinguishes various semi-supervised methods, whereas the supervised loss
typically refers to the cross-entropy loss between predictions and manually annotated masks.
3.2. Auxiliary Mean Teachers and Student Models
Although many more advanced models and methods have emerged, classical methods such as self-training and mean teachers can also perform well with improvements. To further improve the performance of the model based on the self-training method, we incorporate an improved mean-teacher mechanism.
Figure 3 shows the architecture of our model. The model uses the classical encoder–decoder model, using ResNet-101 as the encoder for better pixel information extraction and restoration and using DeeplabV3+ as the decoder. All auxiliary teachers and students share the same structure, and both of the two auxiliary teachers receive exponential moving average (EMA) transfers of parameters from different epochs of the student, as shown in
Figure 4, i.e.,
where
. For the training of teacher models, we update the parameters of only one of the two teachers at each training epoch.
3.3. Training Strategy: Self-Training (ST)
Currently, the perturbation method based on consistency regularisation is widely considered the most effective approach to semi-supervised learning. This method involves perturbing the image during data augmentation and improving the performance of the model by constraining the similarity between its final predicted outputs. These perturbations mostly focus on perturbing the image representation, such as adding noise, random dropout, cutout, and CutMix. However, the unprocessed RGB colour features of the forest satellite RS dataset are not obvious, and the data distribution is imbalanced. Random perturbations at the image level make it difficult for the classes with few samples to be fully learned, while the classes with many samples are constrained by the loss function, making it difficult for the model to fully learn from those samples, resulting in poor model performance. At the same time, the perturbation method based on consistency regularisation requires manually adjusting the weights of unsupervised loss and supervised loss, and it is difficult to adjust the proportion for different weights, leading to a further decline in model performance.
Therefore, in order to fully utilise the information in the dataset and reduce the settings of hyperparameters, we primarily adopt a method based on self-training, as shown in
Figure 5. Firstly, we train the teacher model
on the labelled dataset
, and then
is used to assign pseudo-labels to the remaining unlabelled samples in
. Finally, the pseudo-labels are used as the labels for the unlabelled images, and the student model
s is trained on the entire dataset
. During the training process, auxiliary teachers are introduced to more stably evaluate and predict unlabelled images.
The pseudocode of our ST framework is shown in Algorithm 1.
Algorithm 1 ST with perturbation |
Require: Labelled training set , |
Unlabelled training set , |
Strong/VAT/weak augmentations , |
Teacher/student model |
Ensure: Student model s- 1:
for minibatch do - 2:
for k ∈ do - 3:
← - 4:
- 5:
Update to minimise and of - 6:
Update with EMA of s - 7:
end for - 8:
end for - 9:
Label() - 10:
for minibatch do - 11:
for k ∈ do - 12:
← - 13:
- 14:
- 15:
Update s to minimise and of - 16:
Update s to minimise and of - 17:
Update with EMA of s - 18:
end for - 19:
end for - 20:
return s
|
In this training strategy, strong augmentation refers to data transformations that alter the colour, contrast, and other properties of the image, such as colorJitter, random greyscale, blur, etc. On the other hand, weak augmentation pertains to transformations like resizing, cropping, and horizontal flipping that do not modify the main features of the original image. It is important to note that we employ weak augmentation during the supervised phase, whereas in the unsupervised phase, we adopt a combination of strong and weak augmentations. Both of these types of augmentation fall under input-level image perturbation and, together with subsequent transformations like VAT (feature-level perturbation), constitute multi-level perturbation.
3.4. Multi-Level Perturbation (MP)
In the process of model learning, the most basic self-training method will overfit the errors during iteration and reduce the performance of the student model. To better capture the intrinsic information of forest images and mitigate the risk of overfitting incorrect labels, we propose a multi-level perturbation strategy including input-level image perturbation, feature-level perturbation, and model-level perturbation.
3.4.1. Input-Level Image Perturbation
Given the limited prominence of colour features and the difficulty in learning image features, we opted for a mixed-image augmentation as input-level perturbation for fully supervised and unsupervised learning, which allows the model to prioritise the overall image rather than focusing solely on partial regions. Specifically (as shown in
Figure 3), in the fully supervised stage, we applied weak augmentations, including resize, crop, and horizontal flip transformations, to the input images. In the unsupervised stage, we employed strong augmentations, including colorJitter, random greyscale, and blur, as well as random cutout and contrast/colour-filtering techniques, on the input images.
3.4.2. Feature-Level Perturbation VAT
Feature perturbation involves the creation of adversarial perturbations that pose a challenge by intentionally disrupting the cluster or low-density assumption. This is achieved by transforming the image features, computed from the model encoder, towards the classification boundaries within the feature space. One effective approach to generating such adversarial feature noise is through virtual adversarial training (VAT), which optimises a perturbation vector to maximise the divergence between correct and adversarial classifications. However, current methods estimate the adversarial noise by using the same single network where the consistency loss is applied. Thus, we suggest estimating the adversarial noise using the more accurate teachers and then applying this estimated noise to the feature of the student model, which we call VAT feature perturbation. In particular, we used the VAT feature enhancement method in both the fully supervised and semi-supervised stages and achieved significant improvements. VAT is used in the student model output, i.e.,
where
is the prediction result of the student model for each pixel, and
and
are, respectively, the encoder and decoder of the student model. The adversarial feature perturbation
is estimated from the response of the ensemble of teacher models using
where
, and d(.) is the sum of the pixel-wise Kullback–Leibler (KL) divergence between the original and perturbed pixel predictions.
Figure 6 illustrates feature-level perturbation.
3.4.3. Model-Level Perturbation
To enhance the perturbations of the self-training method, we introduced model-level perturbations through auxiliary teachers. In the unsupervised stage, we utilised the teacher model, which received parameters from the student model with EMA, to predict the input images. Based on the assumption of consistency regularisation, multiple models trained at different stages should have similar predictions for the same image. By measuring the differences between the labels generated by teachers and the student model using a loss function, we provided feedback to the student model and updated its parameters, thus improving the generalisation performance of the model.
It is worth noting that, based on the idea of gradually strengthening model learning, we used weak data augmentation methods and feature-level VAT perturbation instead of directly incorporating strong data augmentation methods into the supervised learning. We found from experiments that training the model from weak to strong effectively improved the performance of the supervised learning as well. Using strong data augmentation throughout all epochs may lead to a decrease in the performance of the model. However, if strong data augmentation is selectively applied in the latter half of the epochs, there is potential for improvement in the performance. Since VAT perturbation searches for pixels with more noise in the current features, it is related to the performance and is considered a weak-to-strong data augmentation method. Meanwhile, based on the idea of progressive learning, we gradually shifted the focus of the loss function towards the labels generated by the teacher model by dynamically adjusting its attention. This adjustment allowed the loss function to increasingly prioritise the labels provided by the teacher model as the training progressed, leading to more reasonable model learning.
3.5. Ensemble Voting: Pseudo-Label Generation Strategy (EV)
In a model based on the self-training paradigm, the assignment of pseudo-labels to unlabelled data using a trained teacher model plays a crucial role. However, past self-training methods have faced limitations when manually setting confidence thresholds and filtering the softmax probability results from the model. Although this approach has contributed to some improvement in the confidence of the model, it suffers from issues such as the inefficient utilisation of all pixels, heavy reliance on artificially set hyperparameters, and the presence of fuzzy segmentation boundaries.
To address these challenges and enhance the model predictions, we integrated TTA technology. TTA leverages multiple scales of images, and we uniformly resized them using bilinear interpolation to ensure consistent dimensions for predictions. In our experiments, we explored different scaling factors, including 0.5, 0.75, 1.0, 1.5, and 2.0, and weighted the predictions obtained at each scale. Furthermore, we horizontally flipped the images at each scale to augment the ability of the model to recognise objects in the image. By aggregating the predictions from all scales and applying softmax, we obtained the final prediction.
Additionally, we introduce auxiliary teachers to further support the decision-making process of the model. Following the ensemble learning vote concept, we utilise TTA technology to predict labels using the student model. Subsequently, we combine the TTA-augmented predicted labels from the auxiliary teachers and the student model, assigning appropriate weights to achieve more reliable and smoother labelling results.
Overall, our labelling method, outlined in Algorithm 2, effectively leverages TTA to enhance the model predictions and achieve improved performance in forest cover mapping. The incorporation of TTA for both auxiliary teachers and the student model contributes to better decision making and yields superior segmentation results.
Algorithm 2 Labelling |
Require: Unlabelled training set , |
Test-Time Augmentation , |
Teacher/Auxiliary teacher model |
Ensure: pseudo-label - 1:
for minibatch do - 2:
for k ∈ do - 3:
- 4:
- 5:
- 6:
- 7:
- 8:
end for - 9:
end for - 10:
return
|
3.6. Adaptive Loss (AL)
To address the issue of dataset imbalance, we trained the model using a combination of cross-entropy loss and focal loss. Cross-entropy is given by
where
y is a one-hot vector with length
C, representing the true class;
is the
cth element in the vector
y;
is the predicted probability distribution vector of the model; and
represents the probability that the model predicts the
cth class. Focal loss is given by
where
y is a one-hot vector with length
C, representing the true class;
is the
cth element in the vector
y;
is the predicted probability distribution vector of the model;
represents the probability that the model predicts the
cth class; and
is a hyperparameter called the focusing parameter, which is used to adjust the degree of attention that the loss function pays to the predicted probability of different classes. When
, focal loss pays more attention to the mispredicted samples, thus reducing the problem of class imbalance.
It is worth noting that focal loss was originally designed to solve the problem of extremely imbalanced positive and negative samples, as well as samples that are difficult to classify. Given that the focal loss adjusts the ratio of positive and negative sample loss adaptively based on the difficulty of the dataset, we did not remove focal loss when conducting experiments on balanced datasets. At the same time, to minimise problems caused by the manual setting of hyperparameters, we introduced an automatic coefficient,
where
represents the output of the auxiliary teachers,
represents the pseudo-labels generated by the previous stage model, and
represents the predicted results of the student model. As training progresses, the reliability of the pseudo-labels generated during the initial training phase decreases, and the predictions from the auxiliary teachers become more accurate. Therefore, the loss for the original labelled data gradually decreases as the training progresses, while the loss for the predictions between the auxiliary teachers and the student model gradually increases. At the same time, the loss between the predicted results of the model and the pseudo-labels generated by the previous stage of the model and the auxiliary teachers can also be seen as a consistency regularisation method. Currently, we have only tried linear transformations and have not explored whether there are better adaptive adjustment coefficients.
4. Data and Experiments
4.1. Data Description
In evaluating the performance of forest cover mapping, we utilised two datasets sourced from the SentinelHub satellite image database. The details regarding the number of images and the distribution between forest and non-forest categories are presented in
Table 1. Both datasets consist of four-band data, with one originating from the Amazon Rainforest and the other from the Atlantic Forest (Mata Atlantica) [
33]. The geographical distribution of these biomes can be observed in
Figure 7, accompanied by sample images highlighting the dataset concentration in two distinct regions.
To streamline the training process, we adopted an approach similar to a related work [
34] by training solely on the RGB channels extracted from the four-band dataset. Notably, the model demonstrated favourable performance. Each image in the datasets has dimensions of (512, 512, 3), while each forest cover mapping mask is represented by (512, 512, 1).
These datasets provide insights into two real-world scenarios: one involving an imbalanced class distribution and complex images (Atlantic Forest), posing a challenging learning task, and the other involving a balanced class distribution and relatively simple images (Amazon Forest).
4.2. Experimental Settings
To ensure a fair comparison with most existing works, we maintained consistent hyperparameters between the supervised pre-training of the teacher models and the semi-supervised re-training of the student model. Specifically, we set the batch size to 8 during training with a V100-SXM2-32GB GPU. For optimisation, we used the SGD optimiser with an initial base learning rate of 0.001 for the backbones. The learning rate of the randomly initialised segmentation head was 10 times larger than that of the backbones. Additionally, we adopted poly scheduling to decay the learning rate during the training process, i.e.,
The model was trained for 80 epochs using weak data augmentations, which includes the random flipping and resizing of training images between 0.5 and 2.0. For strong data augmentations on unlabelled images, we used colorJitter with the same intensity as in [
31], greyscale, blur (same as in [
35]), and cutout with random values filled. The cutout regions were ignored in loss computation. During the pseudo-labelling phase, all unlabelled images underwent TTA, which involved five scales and horizontal flipping. The testing images were evaluated at their original resolution, and no post-processing techniques were employed. It is worth noting that to enable a fair comparison with most existing works, we have not incorporated any advanced optimisation strategies, such as OHEM in [
36], auxiliary supervision in [
36,
37], or SyncBN, into our method.
4.3. Evaluation Metrics
To evaluate the performance of our proposed method in forest cover mapping, we measured several evaluation metrics, including IoU, mean IoU (mIoU), mean precision, mean recall, mean F1-score, and accuracy for both forest and non-forest classes, i.e.,
where
c represents the number of shared classes between the benchmark datasets
.
Figure 8 shows the class distribution of the two datasets used.
The evaluation metrics were computed using the confusion matrix generated by our semi-supervised segmentation framework, which contains the pixel numbers of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Specifically, mIoU and IoU measure the similarity between predicted and ground-truth forest/non-forest areas, while accuracy measures the overall percentage of correctly classified pixels. The mean precision, mean recall, and mean F1-score consider both precision and recall, which are suitable for multi-class and pixel-level classification tasks. Our results demonstrate the superior performance of our proposed method in accurately mapping forest cover, as evidenced by the higher values of these evaluation metrics.
4.4. Results and Analysis
Based on the aforementioned forest RS datasets, we compared our proposed method with several SOTA semi-supervised semantic segmentation frameworks, including the feature-perturbation-based CCT [
25], the multi-perturbation-based PS-MT [
26], and baseline ST with TTA. We also included the supervised DeeplabV3+ [
38] model using ResNet-101 [
39] trained only with labelled data as a baseline. All semi-supervised methods were implemented with identical experimental conditions and settings to ensure fairness. The results demonstrate the effectiveness and superiority of our proposed method in accurately mapping forest cover. Additionally, our method has huge potential for practical applications in forest monitoring and management.
4.4.1. Comparison Results on the Atlantic Forest Dataset
Table 2 and
Table 3 show comparative results with other SOTA methods on the validation and test sets of the Atlantic dataset. As introduced in
Section 2, the CCT method based on image surface perturbation cannot effectively learn the information of satellite RS data, resulting in poor performance with inadequate fitting during the training process. Similarly, other semi-supervised models (e.g., PS-MT) also suffer from this problem and the instability caused by numerous hyperparameters, performing far worse than our proposed method on multiple metrics after our adjustment. Methods based on self-training and data augmentation using colour transformations exhibit stronger performance, and our method, which is an improvement on self-training, outperforms them. Meanwhile, compared with fully supervised methods, our approach also shows significant superiority.
Figure 9 graphically shows the performance of mIoU for different models. Specifically, on the validation set, except for a slightly lower forest segmentation IoU compared to the ST method at the 1/16 split, our model outperforms the other SOTA methods in all metrics. Similarly, on the test set, our method surpasses the majority of the SOTA methods in most metrics. For the important metric mIoU in forest cover segmentation, our model demonstrates superior performance across different splits, as well as on the validation and test sets. This indicates that our model is capable of effectively learning information from satellite RS images and exhibits strong robustness.
We present partial visual comparison results of all the methods in
Figure 10. On the Atlantic Forest dataset, due to the data imbalance and the difficulty of image learning, it can be observed that our proposed method produces smoother boundaries and highlights more details compared to the fully supervised methods. Moreover, our method yields predictions that are closer to the ground truth compared to other semi-supervised methods.
4.4.2. Comparison Results on the Amazon Forest Dataset
Table 4 and
Table 5 present the comparison results on the Amazon Forest validation and test sets using our evaluation metrics of accuracy, mRecall, mPrecision, mF1, and mIoU. The tables show that on the datasets with class balance and lower image difficulty, the supervised Deeplabv3+ achieves better segmentation accuracy than the results obtained on the Atlantic Forest dataset. Similarly, the performance of PS-MT gradually improves, but due to the existence of manually adjusted semi-supervised loss ratio coefficients, the overall performance remains unstable. Additionally, the CCT method fails to learn image information due to the challenges presented by satellite RS datasets. Consistent with our expectations, the self-training method still exhibits strong stability and significantly improves performance across all evaluation metrics compared to fully supervised methods. However, as the amount of labelled data increases, the performance of the model decreases. Furthermore, the performance of the supervised method varies due to varying image difficulties, with its performance at the 1/8 partition being inferior to that achieved at the 1/16 partition. In contrast, our proposed Semi-FCMNet, which enhances the perturbation of the model, leads to performance improvements. Moreover, our method also outperforms the supervised method and other SOTA methods on the test set, indicating its strong generalisation ability.
Figure 11 graphically shows the performance of mIoU for different models.
We also present partial visual comparison results of all methods on the Amazon Forest dataset in
Figure 12. Due to the lower sample complexity of this dataset, most methods achieve good prediction results. However, it is worth noting that our proposed method generates accurate predictions at different partition ratios, as shown in the figure.
4.5. Ablation Experiments
Ablation studies were conducted to validate the effectiveness of each key component of our proposed method. Our method mainly consists of the following four core components: hybrid perturbations (including input image representation level (MP), feature level, and model level); AL; and a pseudo-label generation strategy based on TTA and multi-model voting (PGS). We present specific metric data in
Table 6 and
Table 7, along with the visualisation results of the ablation experiments in
Figure 13 and
Figure 14.
Base: When employing the basic self-training method on the imbalanced Atlantic Forest dataset with high sample learning difficulty, the performance of the model at the 1/16 partition is slightly inferior to that of the fully supervised method. However, in other partitions, the semi-supervised method showcases its superiority by exhibiting better performance on the validation and test sets compared to the fully supervised method. This outcome fully demonstrates the advantages of the semi-supervised approach. However, on the Amazon Forest dataset with lower sample learning difficulty, the basic self-training method is prone to overfitting to the noise and cannot exceed the performance of the fully supervised method.
MP: After adding the MP method, the various indicators of the model are further improved, which proves that adding perturbations to the self-training paradigm, i.e., integrating the consistency regularisation method with the self-training method, effectively improves model performance.
EV: When the pseudo-label generation strategy based on TTA and multi-model voting (EV) is added, the scores of the model on important indicators, e.g., mIoU and mF1-score, are improved in various partitions of datasets with different data distributions. It can be seen that the method of generating pseudo-labels has a significant impact on the performance of the self-training paradigm, and our pseudo-label generation method effectively improves the performance of the model.
AL: After adding the adaptive loss, the model dynamically adjusts its focus on the loss. At the beginning of the training, the model pays more attention to the predicted results of the teacher model in the previous round. However, due to the limited performance of the teacher in the previous round, the current student model cannot be effectively improved in the later stage of training. The introduction of the adaptive loss makes the model focus more on the difference between its multiple prediction results in the later stage and improve its own performance through consistency regularisation. The improvement in various indicators reflects the correctness of our method.