A Multi-View Integrated Ensemble for the Background Discrimination of Semi-Supervised Semantic Segmentation

: The key to semi-supervised semantic segmentation is to assign the appropriate pseudo-label to the pixels of unlabeled images. Recently, various approaches to consistency-based training and the filtering of reliable pseudo-labels have shown remarkable results. Nonetheless, there are still issues to be addressed. We find that recent approaches have specific problems in common. In pseudo-labels for training unlabeled images, we confirm that false foreground class pseudo-labels are mostly caused by background class confusion, not confusion between different foreground classes. To solve this problem, we propose a foreground and background discrimination model for semi-supervised semantic segmentation. Our proposed model is trained using a novel approach called multi-view integrated ensemble (MVIE) via output perturbation. Experimental results in various partition protocols show that our approach outperforms the existing state of the art (SOTA) in binary prediction on unlabeled data, and the segmentation model trained with the help of our model outperforms existing models.


Introduction
Semantic segmentation, which aims to achieve pixel-level classification in images, is a fundamental task in computer vision.It is widely applied in real-world applications, including robotics, AR, VR, autonomous driving, disease diagnosis, and more.With the rise of supervised learning-based deep neural networks such as convolutional neural networks (CNNs), this method has shown impressive performance [1][2][3][4].However, it still requires a large-scale labeled dataset for training [5,6].When building a large dataset for training, tasks such as image classification require one label per image, but tasks such as segmentation require pixel-wise labeling, which can lead to high costs.To alleviate this problem, there have been studies on semi-supervised semantic segmentation that effectively utilize small amounts of labeled data and large amounts of unlabeled data.Recently, a semi-supervised semantic segmentation study further demonstrated the possibility of utilizing unlabeled data by showing results close to the performance of fully supervised semantic segmentation [7][8][9][10].
A common solution for semi-supervised semantic segmentation is to make predictions for unlabeled data using a model trained on labeled data, and then use those predictions as pseudo-labels to enhance the model through training [11].Since the pseudo-labels of this method are predictive results, predictive errors exist, which can have a negative effect on model training [12,13].A typical solution to overcome this problem is to use predicted confidence scores to filter, and use only reliable pseudo-labels with high confidence and not unreliable pseudo-labels with low confidence [14].Current state-of-the-art (SOTA) semisupervised semantic segmentation models are based on consistency regularization [9,15,16].One example independently applied weak and strong perturbations to unlabeled images.Then, weakly augmented unlabeled image predictions with high confidence scores were assigned as pseudo-labels to strongly augmented unlabeled images, forcing consistency of output [14].Perturbations can be applied in a variety of ways.Input perturbations can be applied using methods such as CutOut [17] and CutMix [18], as well as classical image augmentation methods such as color jitter.Feature perturbations can also be applied by injecting noise into the feature space.In particular, network perturbations are applied by encouraging consistency in the predictions of multiple models trained from different initializations.This approach showed better results than input or feature perturbation approaches [15].Consistency regularization is based on a smoothness assumption, which means that if two input points are close in the input space, the corresponding two labels must be the same [19].This improves generalization performance by encouraging the model to make stable predictions even with small perturbations [8].However, this assumes perfect prediction for unlabeled data, which is difficult to achieve in practice, and this assumption includes the implication that perturbations do not push image features to the wrong side of the true decision boundary [19].To reduce the risk of these assumptions, many previous studies have attempted to obtain high-quality pseudo-labels, based on confidence scores.Although significant results have been achieved through various pseudolabel-based approaches for unlabeled data, there has not been much exploration of what specific problems pseudo-labels have.From this perspective, we found a problem that recent SOTA models have in common: despite various approaches, pseudo-labels based on predictions for unlabeled data have common error characteristics.Figure 1a-c shows the normalized confusion matrix of pseudo-labels in semi-supervised semantic segmentation SOTA methods [7][8][9].In a normalized confusion matrix, which is a square matrix, the main diagonal element represents the correct prediction results, and the remaining lower or upper diagonal elements represent the incorrect prediction result for each class.As can be seen from all three of these graphs, the false pseudo-labels in most of the foreground classes are mostly confused with the background class (class 0), not with the other foreground classes.We believe that this result is due to a lack of information and predictive certainty in semi-supervised semantic segmentation, considering that all areas except foregrounds of interest should be assigned as background.That is, unlike each foreground class that may have common image information, the background class may have a complex and diverse image information structure, which can further cause confusion.In many realworld applications, this problem can be considered even more important in that it aims to classify all remaining areas as background areas except for some foregrounds of interest, and it is necessary to perform good predictions on unseen data.To alleviate this problem, we propose a new approach called Multi-View Integrated Ensemble (MVIE), which can better distinguish between the background and foreground in semi-supervised semantic segmentation.MVIE is based on a novel ensemble approach based on output perturbation, and is described in detail in Section 3.
We evaluated the proposed MVIE under various training settings in PASCAL VOC 2012 [20], where many different kinds of objects are assigned background areas.Our experimental results on unlabeled data show that our proposed approach has better background or foreground discrimination capabilities than recent SOTA semi-supervised semantic segmentation models.Furthermore, we used our MVIE model to first determine pseudolabels as background or foreground, and then experimented with semi-supervised semantic segmentation SOTA models using pseudo-labels under that discrimination.In this way, all experimental results combining our MVIE model and recent SOTA models show better performance than those found using single SOTA models.Specifically, our contributions include the following:

•
We find a common problem with pseudo-labels in semi-supervised semantic segmentation SOTA models.The reason for false foreground pseudo-labels is not primarily due to confusion among other foreground classes, but rather, it is mainly a problem of confusion with the background class.• To alleviate the above problem, we propose a novel ensemble approach based on output perturbation.Our method outperforms existing SOTA models in background and foreground classification performance on unlabeled data.

•
When training an existing SOTA model with the help of our model, although the computational cost of training increases, the inference process for practical use incurs the same cost as each existing model.Therefore, from a practical usage perspective, further improved performance can be achieved without increasing computational costs.

Related Work
In this section, we review approaches to semi-supervised learning and semi-supervised semantic segmentation-related works.

Semi-Supervised Learning
The goal of semi-supervised learning is to improve model accuracy by accurately and effectively learning not only information from labeled data but also information from unlabeled data.Two representative approaches for this problem are: entropy minimization and consistency regularization [14].

Entropy Minimization
Entropy minimization aims to minimize the predictive uncertainty of a model for unlabeled data.The entropy of predictions, calculated as a negative sum of the product of predictive probabilities and log predictive probabilities, is often used as a measure of uncertainty in model predictions.Minimizing entropy encourages the model to perform low-entropy predictions for unlabeled data.Recently, a more intuitive and effective framework, the entropy minimization of self-training [21][22][23], has shown its effectiveness by assigning pseudo-labels to unlabeled data and then retraining them in combination with labeled data.Within this training method, the quality of pseudo-labels can be an important factor in the effectiveness of entropy minimization.For this reason, many recent studies have used predictive probabilities as indicators to select and train more accurate pseudo-labels [7,12,14,24].

Consistency Regularization
Consistency regularization encourages consistency in the output from inputs that have been perturbed in various ways, allowing decision boundaries to be placed in low-density regions.There are various methods of perturbing inputs, but FixMatch [14] encourages prediction consistency by perturbing inputs similar to data augmentation.Cutmix [18] perturbs the input by replacing part of an image with part of another image, which is common in many high-performance approaches.
FixMatch uses an approach to supervise unlabeled data with strong perturbations using predictions for data with weak perturbations.Through this method, which essentially utilizes both entropy minimization and consistency regularization, the possibility of success of semi-supervised learning was experimentally demonstrated.

Semi-Supervised Salient Object Detection
Salient object detection (SOD) is a crucial computer vision task aimed at precisely identifying and segmenting distinctive regions within an image, using methods that closely mimic the way humans perceive visually unique information.This task is partly related to ensuring good performance in background and foreground segmentation, which is a key aspect of our research.In recent years, SOD has attracted attention because salient image regions can be applied to modern computer vision tasks such as object recognition, visual tracking, image segmentation, etc.In light of this attention, state-of-the-art fully supervised SOD models have achieved remarkable performance, relying on a large amount of pixel-wise labeled data.However, obtaining such a fully-labeled dataset is expensive and time-consuming.Therefore, recent developments have focused on semi-supervised SOD models used to overcome a lack of labeled data, and challenges in SOD such as object size variety, object invisibility, cluttered backgrounds, etc. LFCS [25] employs semi-supervised learning to distinguish unlabeled regions by leveraging a substantial amount of unlabeled data alongside labeled data to enhance classifier performance.LFCS utilizes linear feedback control theory as a mathematical foundation for formulating semi-supervised calssifiers.EBM [26] is the latent variable model for semi-supervised SOD, conceptualized as a problem of learning pseudo-label confidence.Also, EBM incorporates a non-Gaussian prior distribution through an energy-based model for the latent variable.The exploration of an informative latent space enhances confidence estimation accuracy, facilitating the effective utilization of unlabeled training data.ASOD [27] is an active learning framework for semisupervised SOD, desigened to optimize network performace with minimal annotation costs.ASOD introduces adversarial learning and unsupervised feature representation through a Variational Autoencoder (VAE) to identify discriminative and representative samples for addition to the labeled pool.

Supervised Semantic Segmentation
HRNet [28] connects high-to-low convolution streams in parallel, ensuring the maintenance of high-resolution representations throughout the process.It achieves reliable high-resolution representations with strong position sensitivity by iteratively fusing representations from multi-resolution streams.It enables HRNet to achieve superior results on a wide range of visual recognition problems including semantic segmentation as a stonger backbone.Wang et al. [29] introduced a supervised, pixel-wise contrastive learning approach for semantic segmentation, transitioning from the current image-wise training strategy to an inter-image, pixel-to-pixel paradigm.This design enables access to more representative data samples, and facilitates the exploration of structural relations between pixels and semantic-level segments, emphasizing proximity in the embedding space for pixels and segments of the same class.Zhou et al. [30] introduced a novel approach to semantic segmentation by abstracting each class through a set of prototypes that effectively capture class-wise characteristics and intra-class variance.The interpretability of the model is enhanced as the prediction for each pixel is intuitively understood to reference its closest class center in the embedding space.

Semi-Supervised Semantic Segmentation
Recent studies on semi-supervised semantic segmentation have shown excellent results that are close to the performance of fully supervised semantic segmentation models by applying consistency-based methodologies in various ways.CCT [15] enforces agreement between the results of applying perturbed features of various other kinds, and the results of unperturbed features.CPS [9] uses the network consistency method by forcing consistency on the outputs of two models starting with different initializations.PS-MT [8] uses confidence-weighted cross-entropy loss, which multiplies the cross-entropy loss by the segmentation prediction confidence when calculating the unsupervised loss of unlabeled data.PS-MT additionally enforces consistency for predictions using both network perturbation by two teacher models, input perturbation using CutMix with weak and strong augmentation, and feature perturbation using virtual adversarial training (VAT).U 2 PL [7] uses reliable pseudo-labels by filtering based on the probability distribution entropy of all pixels.U 2 PL additionally uses pseudo-labels for most pixel addresses by pushing unreliable pseudo-labels into a queue composed of negative samples and using them for contrastive loss, which is an unsupervised loss.UniMatch [31] revisits FixMatch, a semi-supervised image classification study, for semi-supervised semantic segmentation research.Interestingly, when the FixMatch study with a simple pipeline is converted to a semi-supervised semantic segmentation scenario, it shows competitive results compared to the SOTA study.However, since FixMatch relies heavily on strong augmentation based on passive design, UniMatch has proposed a Unified Dual-Stream Perturbations approach to mitigate this issue.As a result, the method experimentally reports improved performance by leading to an expanded perturbation space.S4MC [32] proposes a novel confidence refinement scheme to improve the quality of pseudo-labels for semantic segmentation.Unlike common solutions that do not use pseudo-labels for low-confidence predictions, S4MC leverages the spatial correlation of labels in segmentation maps by grouping adjacent pixels and considering pseudo-labels collectively.Through this, S4MC maintains the quality of pseudo-labels while simultaneously increasing the amount of pseudo-labels used during training.Several SOTA studies have shown semi-supervised semantic segmentation results that can be used in practice by using various consistency and pseudo-labels filtering methods.
However, as discussed in Section 1, technical problems such as overfitting and reliability problems of pseudo-labels have been reviewed so far, but not much has been written about the specific problems of pseudo-labels.We found that a common reason for mispredicted pseudo-labels used in recent SOTA studies is that there is no confusion between different foreground classes, but mostly between each foreground and background class.Therefore, we propose an improved background and foreground binary segmentation model for semi-supervised semantic segmentation.Our approach uses a consistency training approach based on input perturbations and new output perturbations.

Proposed Method
In this section, we mathematically describe our problem, architecture, and training process.Section 3.1 first gives an overview of the proposed method.The proposed method consists of multiple multi-view teacher networks and student networks.Our strategy for reliable pseudo-label filtering, which is achieved by applying a new ensemble technique called MVIE on the predictions of multi-view teachers, is described in Section 3.2, along with the model architecture of multi-view teachers.Finally, in Section 3.3, pseudo-labels of students generated through the ensemble of all teachers are introduced along with the student model architecture.

Overview
Semi-supervised semantic segmentation aims to efficiently utilize the information of both unlabeled data and labeled data.Therefore, we have a small amount of labeled data i=1 and a large amount of unlabeled data i ∈ R H×W×3 and y l i ∈ R H×W×C , where H and W are the height and width of the image and C is the number of classes.This dataset is used to train students and all teachers.Additionally, y l i is converted and redefined for students and teachers.Figure 2 shows the transformation of the semantic ground truth to train all our models when the number of classes is 6.First, we construct C-1 teacher networks and 1 student network for our approach.The student and teachers output ŷs ∈ {0, 1} H×W and ŷt ∈ {0, 1, 2} H×W , respectively.Therefore, labels for students and teachers have 2 classes and 3 classes, respectively That is, we re-define the labels for students and teachers using Equations ( 1) and ( 2) below for the semantic ground truth: where y l i is the i-th semantic ground truth, and y l(n) i and y l(s) i are the re-defined ground truths for the n-th teacher and student, respectively.The student is trained based on input perturbation, and teachers are trained based on new output perturbation.
Figure 3 shows an overview of MVIE.In the MVIE architecture, the model named teacher consists of a CNN-based encoder h and decoder g with a segmentation head.The teacher model is decomposed into encoder h θ t h : X → Z and decoder g θ t g : Z → Y, where Z ⊂ R Z represents the feature space of dimension Z.All teacher models consist of the number of classes minus 1, and have the same structure.Hence, teachers are denoted , where C is the number of classes.The student consists of one model with only a decoder g θ s g : Z → Y, where Z ⊂ R Z encoded by a fixed teacher represents a Z-dimensional feature space.Therefore, one of the teacher models, named fixed teacher, is responsible for encoding the input and passing the features to the student.All configured teacher models and the student model have different initial weights.For all labeled images, the goal of the student and all teachers is to minimize the standard cross-entropy of Equations ( 4) and (7).For unlabeled images, each teacher model receives an image with weak augmentation and outputs ŷt ∈ {0, 1, 2} H×W .After that, each teacher obtains pseudo-labels with a new ensemble method using the predictions of all other teachers and computes the teacher's unsupervised loss in Equation ( 6).This part is introduced in detail in Section 3.2.The student decodes the features encoded by the fixed teacher from the image with strong augmentation and outputs ŷs ∈ {0, 1} H×W , and the results of applying a hard voting ensemble from predictions of all teachers are used as pseudo-labels to calculate the student's unsupervised loss in Equation ( 9).This part is introduced in detail in Section 3.3.
The optimization target for our students and teachers is to minimize the overall loss, which can be formulated as follows: where L T and L S are teacher and student overall loss, L t sup and L s sup are teacher and student supervised segmentation loss calculated from labeled data, and L t unsup and L s unsup are teacher and student unsupervised segmentation loss calculated from unlabeled data, respectively.λ t and λ s are the weights of the teachers' and student's unsupervised segmentation loss, respectively.In summary, the semantic segmentation labels are converted into different teacher labels for each teacher using Equation ( 1), and then the teacher's supervised segmentation loss L t sup is calculated using the standard cross-entropy loss function.The unsupervised segmentation loss for each teacher L t unsup is calculated using a standard cross-entropy loss function based on each teacher's pseudo-labels generated using our new MVIE method.The student's supervised segmentation loss L s sup is calculated using a standard cross-entropy loss function based on student labels whose semantic segmentation labels are converted using Equation (2).Finally, the student's unsupervised segmentation loss L s unsup is calculated using a standard cross-entropy loss function based on the student's pseudo-labels generated via a hard voting ensemble of all teachers.

Teacher Model Using a Multi-View Integrated Ensemble to Generate Pseudo-Labels
In teachers' unlabeled data training, we propose a novel ensemble technique-based MVIE for reliable pseudo-label filtering.MVIE applies a new output perturbation method.This output perturbation redefines the semantic classes into three classes, and the meaning of the three classes is different for each teacher.We call the teacher with this new output perturbation a multi-view teacher.Figure 4 shows an example in which a multi-view teacher integrated ensemble is applied to the first teacher when the number of classes is 6.As the number of classes is 6, there are 5 teachers (C-1), and the table value for each teacher's class means that the actual semantic classes can be included, i.e., the first teacher's class 0 means the actual semantic class 0, class 1 means semantic class 1, which is the same as the teacher number, and class 2 means 2 to 5, which are all other semantic classes.When we consider the pseudo-labels of a specific teacher, class 0 is defined as an overlapping area in all other teachers' class 0 predictions, class 1 is an overlapping area in all other teachers' class 2 predictions, and class 2 is a non-overlapping area in all other teachers' class 1 predictions.
As introduced in Section 3.1, there are as many teacher models with an encoder-decoder structure as the number of classes minus 1, i.e., teacher-1, teacher-2, teacher-3, . . ., teacher (C-1).Also, each teacher outputs ŷt ∈ {0, 1, 2} H×W .A value of 0 in each teacher's predicted output represents the background class of the semantic ground truth (generally used as class 0), 1 represents the actual semantic ground truth class corresponding to the teacher's number, and 2 is assigned to all classes except those assigned to classes 0 or 1 in the actual semantic ground truth classes; this means that in the PASCAL VOC 2012 dataset with ground truth semantic classes ranging from 0 to 20, if we consider the case of teacher-3, the predicted output 0 indicates the background class 0 in the actual PASCAL VOC 2012 ground truth, 1 indicates the ground truth semantic class 3, and 2 indicates all classes except 0 and 3, which are 1, 2, 4, 5, 6, ..., 20.In the teacher's overall loss L T introduced in Equation ( 3), the first loss is the supervised segmentation loss L t sup , defined based on the cross-entropy (CE) loss as follows: where l ce is the cross-entropy loss function, and x l i and y l(n) i represent the i-th labeled image and corresponding n-th teacher's label, respectively.h θ t h and g θ t g represent the teacher's encoder and decoder, respectively.g • h is the composition function of h and g.The second term in Equation ( 3) is the unsupervised segmentation loss L t unsup for the pseudo-labels made using the new ensemble technique MVIE.We use MVIE to filter out only reliable pixel-level pseudo-labels and ignore unreliable ones.Therefore, unreliable pseudo-labels are not subject to supervision, and we define the pseudo-labels of the n-th teacher made via MVIE for the i-th unlabeled image at pixel j, as follows: where C represents the number of classes.The unsupervised segmentation loss L t unsup is defined as where x u ij and ŷ(n) ij represent the i-th unlabeled image and corresponding pseudo-labels at pixel j, respectively, and A w (•) represents a weak augmentation function, such as image flipping, cropping, or scaling.
Finally, each teacher is trained with a loss function, which is the weighted sum of Equation ( 4), L t sup , based on ground truth labels and Equation ( 6), L t unsup , based on pseudolabels made through MVIE in Equation (5).

Student Model That Outputs a Binary Using an Ensemble of Multi-View Teachers
As introduced in Section 3.1, the student model consists of one model and has only a decoder structure.Also, the student outputs the binary output ŷs ∈ {0, 1} H×W .
In the student's training of unlabeled data, strong augmentation is applied for input perturbations.Weak augmentations, such as image flipping, cropping, and scaling, are applied for the teacher model, while strong augmentations, such as Gaussian blur, randomized grayscale, and color jitter, are applied for the student model.Among the teachers, one randomly selected is designated as the fixed teacher, and this teacher passes the features obtained by encoding the input image to the student.In the overall loss L S of the student introduced in Equation (3), the first loss, supervised segmentation loss L s sup , is defined as where x l i and y l(s) i represent the i-th labeled image and corresponding student's label, respectively, and h θ t( f ixed) h and g θ s g represent the fixed teacher's encoder and the student's decoder, respectively.The student's pseudo-labels for the i-th unlabeled data at pixel j, based on the hard voting ensemble for all teachers, are defined as follows: where 1(•) is the indicator function, f hv is the hard voting ensemble function, and x u ij is the i-th unlabeled data at pixel j.
The second loss of Equation ( 3), unsupervised segmentation loss L s unsup , is defined as where A s (•) represents a strong augmentation function.
Finally, the student is trained as the weighted sum of Equation ( 7), L s sup , based on ground truth labels, and Equation (9), L s unsup , based on the student's pseudo-labels ŷ(s) ij made by the hard voting ensemble of all teachers.

Experiments
This section introduces the evaluation metrics, data, model, and implementation details used in the proposed method.We experimented under different partition protocol settings to investigate the efficiency and effectiveness of the proposed method, and investigated its performance on the unlabeled dataset as well as on the validation dataset.

Dataset
The standard datasets for semi-supervised segmentation include PASCAL VOC 2012 and Cityscapes.However, while PASCAL VOC 2012 classes contain multiple object classes and background classes, Cityscapes does not have a background class, and all classes consist of only object classes.Because our goal is to build a model that can discriminate well between background and foreground, we adopted the PASCAL VOC 2012 dataset with a background class.The standard semantic segmentation benchmark dataset, PASCAL VOC 2012, consists of 20 semantic classes of objects and 1 background class.The training and validation sets in this dataset consist of 1464 and 1449 images, respectively.We adopted the general practice of previous studies [7][8][9] using 10,582 labeled images augmented from [33] as additional data.To achieve the goal of semi-supervised semantic segmentation, we subsampled this full training dataset at a ratio of 1:8, 1:4, and 1:2, used it as labeled data, and the remaining ratio was used as unlabeled data.

Implementation Details
For a fair comparison with previous work, we adopted DeepLabV3+ [34] as a segmentation model, using ResNet [35] as the backbone network.The ResNet backbone network uses ImageNet [36] pre-trained weights as initial weights, and the weights of segmentation heads are randomly initialized.The experimental results are listed in Tables 1 and 2, containing all the results of our re-implementation using ResNet-50 as the backbone network.
During the experiments on our model and all the re-implements, each mini-batch consisted of eight labeled images and eight unlabeled images due to hardware limitations in our environment, and the training epochs were set to 80.To train our model with Sync-BN [37], we used the stochastic gradient descent (SGD) optimizer and set the initial learning rate to 0.0025, moment to 0.9, and weight decay to 0.0001.In addition, we adopted a poly learning rate policy, in which the initial learning rate is multiplied by (1 − iter max_iter ) 0.9 at each iteration.The crop size of the images was 512 × 512, with multi-scale data augmentation applied by randomly selecting scales {from 0.5, 0.75, 1, 1.25, 1.5, and 1.75}.Additionally, considering that most SOTA studies have demonstrated performance improvement using CutMix, we apply CutMix after prediction used in [8] to our proposed method.We re-implemented previous SOTA methods in our environment.For fairness, batch size and training epochs were set to 8 and 80, respectively, which are the same settings as our model training, and all other settings are set to be the same as each study setting.All of our experiments were implemented using the PyTorch v1.12.0 framework [38] on servers with two NVIDIA GeForce RTX 3090.

Evaluation Metrics
Following previous research [7][8][9], we report mean Intersection-over-Union (mIoU) as an evaluation metric based on single-scale inference for all evaluations.In addition, we further check the accuracy and F1-score along with Intersection-over-Union (IoU) to assess pixel-level performance on unlabeled data.

Comparison to SOTA on Different Partition Protocols
To verify the background and foreground discrimination performance of pseudolabels on unlabeled data, we temporarily used real labels for unlabeled data from PASCAL VOC 2012.Table 1 shows a comparison between our proposed model and the results of converting the predictions of SOTA models into binary predictions.Table 1 shows a comparison of the proposed method with the results of converting the predictions of SOTA models into binary predictions on the unlabeled data of PASCAL VOC 2012.Table 1 shows the accuracy (Acc), macro-F1 score (F 1 ), background IoU (BG IoU), foreground IoU (FG IoU), and mean IoU (mIoU), showing that our method is experimentally superior in most indicators.In particular, in 1/8, a small labeled ratio experiment, some BG IoU increased and FG IoU increased significantly at the same time.This can be interpreted as a result of using more accurate foreground pseudo-labels in a small real information environment.In conclusion, under the 1/8, 1/4, and 1/2 partition protocols, our method outperforms the highest previous studies' mIoU scores for background and foreground classification by +1.84%, +0.36%, and +0.003%, respectively.Table 2 shows the mIoU score on the validation set of PASCAL VOC 2012, comparing our re-implementation results of existing SOTA models with the case trained using the object areas proposed by our model in the generation of pseudo-labels for each SOTA model.In all the experimental results in Table 2, SOTA models combined with our model show better performance than existing models.This result is because our model, which has better background classification performance, helps improve the quality of pseudo-labels used in learning SOTA models.

Qualitative Results
Figures 5 and 6 visualize the segmentation results of some images from the unlabeled dataset and validation dataset of PASCAL VOC 2012, respectively.In column (f) of Figure 5, we can see that our results, trained based on more accurate pseudo-labels, produce cleaner and better-performing background and foreground classification results than existing SOTA approaches.Furthermore, we can identify better background discrimination capabilities in validation datasets as well.

Conclusions
We discovered a specific confusion problem of pseudo-labels that most SOTA studies have in common.This common problem is that in pseudo-labels of foreground objects, mispredictions are mostly related to confusion between the background and its foreground rather than confusion with other foregrounds.Therefore, this means that a high-performance model can be implemented if only this confusion problem is alleviated in semi-supervised semantic segmentation.To alleviate this problem in semi-supervised semantic segmentation, we propose a background and foreground discrimination model using MVIE based on a new output perturbation and a new ensemble method.We experimentally demonstrated the effectiveness of the proposed method.The numerical results of the experiment show that under 1/8, 1/4, and 1/2 partition protocols, the mIoU scores for background and foreground outperform the existing best model by +1.84%, +0.36%, and +0.003%, respectively.Moreover, the performance of existing SOTA models trained with the help of our model is significantly improved over that of the existing single-SOTA model.Existing SOTA models trained with the help of our model showed no increase in inference time, but some computational cost increases in training.Because our approach consists of multiple networks, the training time is longer, but the inference time is similar to other methods.Therefore, we will consider an efficient approach that does not have a long training time and can alleviate the background confusion problem in a future study.

Figure 2 .
Figure 2. Example of the transformation of the semantic ground truth to train all our models when the number of classes is 6.

Figure 3 .
Figure3.Overview of our proposed MVIE method.MVIE consists of teacher networks with encoderdecoder structures and a student network with only a decoder structure.Teachers are composed of the number of semantic classes-1 (C-1), and the student is composed of only 1.The fixed teacher is responsible for encoding the input image and then passing the features to the student as input.Labeled data are input to all teachers and students, and used to calculate the supervised loss based on cross-entropy.Unlabeled data with weak augmentation are input to all teachers to create predictions, and each teacher's pseudo-labels for the unsupervised loss of a teacher are made by applying a new integrated ensemble approach by multi-view teachers from all other teachers' predictions.The unlabeled data with strong augmentation are additionally input to the fixed teacher and encoded, and the encoded features are input to the student to generate the student's predictions for the unlabeled data.The result of applying a hard-voting ensemble in all teacher predictions becomes a pseudo-label of predictions for the student's unlabeled data, which are used to calculate the student's unsupervised loss.

Figure 4 .
Figure 4.An example in which a multi-view teacher integrated ensemble is applied to the first teacher when the number of classes is 6.

Figure 5 .
Figure 5. Example of qualitative results from PASCAL VOC 2012 unlabeled dataset.(a) Input image, (b) binary ground truth, (c) CPS, (d) PS-MT, (e) U 2 PL, (f) ours.All approaches use DeepLabv3+ as a segmentation network with ResNet-50, and the red rectangle highlights false prediction results.

Figure 6 .
Figure 6.Example of qualitative results from PASCAL VOC 2012 validation dataset.(a) Input image, (b) binary ground truth, (c) CPS, (d) PS-MT, (e) U 2 PL, (f) ours.All approaches use DeepLabv3+ as a segmentation network with ResNet-50, and the red rectangle highlights false prediction results.

Table 1 .
Binary performance comparison with SOTA for unlabeled data of PASCAL VOC 2012 under different partition protocols.Predictions from state-of-the-art models are converted to binary predictions, and all methods are based on DeepLabv3+.* represents our re-implementation.

Table 2 .
Comparison of SOTA and models trained with the help of our model on the PASCAL VOC 2012 val set under different partitioning protocols.All methods are based on DeepLabv3+, and * represents our re-implementation.