Semantic Segmentation Using Pixel-Wise Adaptive Label Smoothing via Self-Knowledge Distillation for Limited Labeling Data

To achieve high performance, most deep convolutional neural networks (DCNNs) require a significant amount of training data with ground truth labels. However, creating ground-truth labels for semantic segmentation requires more time, human effort, and cost compared with other tasks such as classification and object detection, because the ground-truth label of every pixel in an image is required. Hence, it is practically demanding to train DCNNs using a limited amount of training data for semantic segmentation. Generally, training DCNNs using a limited amount of data is problematic as it easily results in a decrease in the accuracy of the networks because of overfitting to the training data. Here, we propose a new regularization method called pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation to stably train semantic segmentation networks in a practical situation, in which only a limited amount of training data is available. To mitigate the problem caused by limited training data, our method fully utilizes the internal statistics of pixels within an input image. Consequently, the proposed method generates a pixel-wise aggregated probability distribution using a similarity matrix that encodes the affinities between all pairs of pixels. To further increase the accuracy, we add one-hot encoded distributions with ground-truth labels to these aggregated distributions, and obtain our final soft labels. We demonstrate the effectiveness of our method for the Cityscapes dataset and the Pascal VOC2012 dataset using limited amounts of training data, such as 10%, 30%, 50%, and 100%. Based on various quantitative and qualitative comparisons, our method demonstrates more accurate results compared with previous methods. Specifically, for the Cityscapes test set, our method achieved mIoU improvements of 0.076%, 1.848%, 1.137%, and 1.063% for 10%, 30%, 50%, and 100% training data, respectively, compared with the method of the cross-entropy loss using one-hot encoding with ground truth labels.


Introduction
The goal of semantic segmentation is to predict the predefined class (or label) of each pixel, which is fundamental yet challenging in computer vision. Owing to its increasing importance, it is widely adopted in various applications using vision sensors, such as autonomous driving [1,2], 3D reconstruction [3], and medical image analysis [4,5]. In recent years, deep convolutional neural networks (DCNNs) have achieved significant performance improvements and have been the dominant solution for semantic segmentation. Since the introduction of FCNs [6], various architectures have been proposed, including U-Net [4], DeepLab [7][8][9][10], and PSPNet [11].
To achieve high performance, supervised learning in addition to a significant amount of training data are typically used in DCNN-based methods. Creating ground-truth labels for semantic segmentation requires more time, human effort, and cost compared with other tasks such as classification and object detection, because the ground-truth label of every pixel is required. Hence, it is practically demanding to train DCNNs using a limited amount of training data for semantic segmentation.
Generally, training DCNNs using a limited amount of data is problematic because it easily results in a decrease in the accuracy of the networks because of overfitting to the training data [12]. Overfitted models generate good results for the training dataset but subpar results for validation and test datasets, which are not used in training. However, many studies regarding semantic segmentation have focused mainly on improving the accuracy by assuming a significant amount of training data, whereas the problem of insufficient data for training has rarely been prioritized.
LS [20,21] generates a smoothed probability vector by adding a one-hot encoding vector using the ground truth and a uniform vector. It enforces the feature from the penultimate layer to be closest to the template of the correct class, while maintaining the same distance as those of the incorrect classes [20]. Hence, the probability generated using LS does not include the correlation information between classes. CP [22] increases the entropy of the prediction probability distribution by subtracting the entropy of the probability from the loss function. It does not include correlations between classes. In addition, it is problematic to further increase the entropy when the entropy of the probability distribution is already large, because this can render the label decision of the pixel ambiguous. KD improves the performance of the student network using the knowledge of the teacher network. However, a good teacher network is typically required to train the student network. Although the methods described above demonstrate good performances, they do not consider the problem of limited training data, and most of them are designed for classification problems, not semantic segmentation.
In this paper, we propose a new regularization method called a pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation to stably train semantic segmentation networks in a practical situation, in which only a limited amount of training data are available. In this regard, we assume that the estimated probability distribution of each pixel exhibits certain relationships and correlations between all pairs of classes [27]. For example, the probabilities of bus and train classes exhibit higher correlations and closer relationships compared with those of bus and sky. Another intuition is that several pixels of the same class exist in an image. Hence, incorrect pixels can benefit from the correct pixels in an image by enforcing consistent distributions between pixels in the same class.
Based on these assumptions, the proposed method generates a pixel-wise adaptive soft label to regularize the estimated probability distribution of each pixel by fully utilizing the internal statistics of the pixels within an input image. Figure 1 shows a schematic flowchart of our method. In this regard, we compute a similarity matrix that encodes the affinities between all pairs of pixels. Based on this matrix, an aggregated probability distribution is computed by adaptively combining the probability distributions of correctly estimated pixels at other positions in an image. Our method compensates for insufficient data using soft labels obtained by aggregating the probabilities of other pixels in an image. However, in the early training step, the correctly predicted pixels are insufficient. Hence, we adaptively add a uniform distribution to the aggregated distribution as a function of the number of training iterations. As such, in the early step, a uniform probability has more weight than an aggregated probability. As training progresses, the aggregated distribution yields a larger weight. Although the aggregated distributions facilitate the reduction in the variance error of the estimation, they can result in increase in the bias error [28]. To reduce both bias and variance errors, we added one-hot encoded distributions with ground-truth labels to these aggregated distributions, which yielded our final soft labels.

Figure 1.
A schematic flowchart of our method. Our method aggregates distributions based on pairwise feature similarity and generates a pixel-wise soft label by weighted sum of a one-hot encoding with ground truth label and the aggregated distribution for each pixel according to training iteration. Figure 2 shows the results of our proposed method and the conventional cross entropy (CE) method [10] for various ratios of limited training data on the Cityscapes dataset [29]. We used the same network as DeepLab-V3+ [10], hyperparameters, and a limited training data to compare those methods. The CE method, which involves less training data, predicts well for load, sidewalk, car, and vegetation classes, but not for bus classes. This is because the pixels for the bus class are fewer in all the training data, and the number of bus class pixels is further reduced in the limited training data. Therefore, overfitting occurs easily in the CE method owing to the limited training data. By contrast, our proposed method yields more accurate results than the CE method. The contributions of our method are summarized as follows: • We propose a new probability regularization method for limited training data using a self-knowledge distillation scheme; • We propose a pixel-wise adaptive label smoothing (PALS) by fully utilizing the internal statistics of pixels within an input image; • We demonstrate the effectiveness of our method by showing improved accuracy compared with previous methods for various ratios of training data, such as 10%, 30%, 50%, and 100% on the Cityscapes dataset [29] and the Pascal VOC2012 dataset [30].

Semantic Segmentation
Semantic segmentation is a pixel-wise classification problem that aims to predict the categories of each pixel in a specified image. Various approaches have been proposed to improve the performance of semantic segmentation since the introduction of FCNs [6]. The encoder-decoder architecture [4,31,32] was proposed in early studies to recover spatial losses caused by pooling layers in the networks. Liu et al. [33] and Peng et al. [34] proposed enlarging the receptive field, which is crucial for obtaining context information. In addition to enlarging the receptive field and capturing multiscale context information, refs. [8][9][10][11] proposed pyramid feature pooling methods. To learn semantically richer and spatially more precise feature representations, [35][36][37][38][39][40] combined multiresolution feature maps. Based on the self-attention scheme [41,42], some researchers [43][44][45][46] proposed capturing relational context information by aggregating the relations between pixels.
However, because these studies did not consider situations involving limited training data, which are typically encountered in real-world applications, several researchers have proposed weakly/semi-supervised learning-based methods to address this issue. Refs. [47][48][49][50] used image-level labels, refs. [51][52][53] used bounding boxes, and [54][55][56][57][58][59][60] proposed utilizing unlabeled images. Whereas additional data or annotations are required in the above-mentioned methods, Zhao et al. [61] proposed a pretraining to address the problem of limited data. Specifically, they trained a network twice by pretraining a model based on label-based contrastive learning [62] first, and then fine-tuning the model with cross-entropy loss. Unlike the method described in [61], the proposed method does not require any pretraining.

Regularization
Regularization is a set of techniques that aims to avoid overfitting and improve the generalization of a model. Typical methods to avoid overfitting the training data include L 1 /L 2 -regularization [14], dropout [16], batch normalization [15] and data augmentation [63]. Additionally, some researchers have proposed regularizing the output of a model using target modification approaches. LS [20,21] uses soft targets, which are the weighted average of one-hot targets and uniform distribution over labels. CP [22,23] regularizes the output of a model by penalizing low-entropy output distributions. These methods prevent the model from becoming overconfident [20,22]. Recently, researchers have extended this idea to other tasks, such as domain adaptation [64,65], incremental learning [66], and self-knowledge distillation (self-KD) [24,67,68]. By contrast, this study focuses on training semantic segmentation models using a limited amount of labeled data. Additionally, the proposed method modifies the target distribution by aggregating the probabilities of pixels based on their similarities to the output of the model.
On the other hand, KD [25,69] exploits the predictions of the teacher model, which is relatively large, to transfer knowledge to the student model, which is relatively small. Recently, various approaches have been extended to semantic segmentation [70][71][72]. However, because the training process based on teacher-student knowledge distillation requires additional teacher networks, the computational costs are high. However, it has been demonstrated [24,67,68,73,74] that self-KD, which causes the model to learn knowledge from itself, is effective in exploiting a potential capacity of a single model. Although these works are simple and effective, they do not demonstrate the effectiveness of their works in the limited labeled data setting for semantic segmentation. Table 1 summarizes the strengths and weaknesses of the various regularization methods described above. Table 1. Strengths and weaknesses of various regularization methods.

Method
Strength Weakness

LS [20]
• It has a positive effect on generalization using the weighted sum of one-hot encoding and the uniform distribution; • It can be applied when a teacher model is not available.
• The weighting factor for the uniform distribution is fixed and not learnable; • The uniform distribution is not learnable and is not optimal for each pixel.
CP [22] • It has a positive effect on generalization using the entropy term; • It can be applied when a teacher model is not available.
• The weighting factor for the entropy term is fixed and not learnable; • It may increase the ambiguity of the estimated distribution when the entropy of the distribution is already large.
KD [25] • It has a positive effect on generalization by use of the prediction of the teacher network.
• It cannot be applied when a teacher model is not available.

Ours
• It has a positive effect on generalization using the weighted sum of one-hot encoding and the pixelwise aggregated distribution; • It can be applied when a teacher model is not available.
• The weighting factor for the pixelwise aggregated distribution is fixed and not learnable.

Cross Entropy
Since the introduction of the FCNs [6], most semantic segmentation networks have been designed using convolutional layers without fully connected layers. The features of the last convolutional layer in a model are known as logits Z ∈ R C×H ×W , where C is the number of classes, and H and W are the height and width of the logits, respectively. The predicted distribution mapP ∈ R C×H×W is then generated from Z, where H is the height, and W is the width of the original input image. It is noteworthy that when Z is different fromP in terms of the spatial size, Z is typically resized to the same resolution asP. In the typical setting,P is defined using the softmax operation, as follows: whereP c (i) denotes the probability of the cth channel of the ith pixel ofP. Subsequently, the CE loss L CE is defined as where Y ∈ R C×H×W is a one-hot encoded distribution map using ground-truth labels, and Y c (i) is the value at the cth channel of the ith pixel of Y. H(Y(i),P(i)) is the CE of the ith pixel. Typically, L CE is defined as the average CE value of for all pixels.

Label Smoothing
LS [20] adds a one-hot encoded ground truth and uniform distribution to generate a smooth probability distribution. Subsequently, the smoothed probability distribution map Y s ∈ R C×H×W is defined as follows: where Y s (i) is the probability distribution vector of the ith pixel of Y s , U ∈ R C is a uniform distribution vector, where each element is 1/C, and λ is the weighting factor for a uniform distribution vector. Subsequently, the label smoothing loss L LS is defined as

Confidence Penalty
CP [22] induces an increase in the entropy of the predicted distributionP(i). The CP [22] loss L CP is defined as follows: where β is a weighting factor, and H(P(i),P(i)) represents the entropy ofP(i).

Knowledge Distillation
KD transfers the knowledge of a well-trained teacher network to the student network to improve its performance on the student network. Typically, the KD loss function L KD [25] is defined as whereP t (i) denotes the predicted distribution of the teacher network at the ith pixel. KL(·, ·) is the Kullback-Leibler (KL) divergence between the two distributions, and γ is a weighting value of the KL divergence term.

Proposed Method
In this section, we introduce our pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation for semantic segmentation. We assume that only a small amount of training data are available to train the network. For each input image, various pixels share the same class, because one object comprises several pixels, and multiple objects may exist in an input image. Hence, our method generates a pixel-wise adaptive soft label for each pixel by aggregating the probability distributions of correctly estimated pixels of the same class. Soft labels function as a teacher in regularizing the distributions of each pixel. Figure 3 shows an overview of our proposed method, which is categorized into training and test paths. Let an input image be I ∈ R 3×H×W , where H is the height, W is the width, and the number of color channels is three. In training path, to improve the network performance, we generate an adaptive soft label map P ∈ R C×H×W using the proposed PALS module, where C is the number of classes. The structure of the PALS module is explained in detail in the following subsection. By comparing P and the estimated probability distributionP ∈ R C×H×W , we compute a loss for training the network. In the test path, we predict our result using only the probability distributionP.  Figure 4 illustrates our PALS module. The input features of the module are logits Z ∈ R C×H ×W and penultimate feature map E ∈ R K×H ×W , where K is the number of channels of the penultimate feature map, and H and W are the spatial sizes. To compute a similarity matrix S ∈ R H W ×H W that contains similarities or correlations between all pairs of features in E, we perform matrix multiplication using reshaped matrices E R ∈ R K×H W and E T R ∈ R H W ×K from E. Therefore, S is defined as

PALS Module
where s i ∈ R H W ×1 is a column vector that includes all correlations between a feature of the ith spatial position and all features in E. To perform normalization for each column vector, we performed a softmax operation along each column axis. Subsequently, S norm is defined as where ρ(·) represents the softmax operation.
To compensate for insufficient training data, we fully utilize the internal statistics of the pixels in the input image. In this regard, we compute a pixel-wise ensemble of distributions by adaptively aggregating the distributions of other pixels based on the pixel affinity. Thus, we have generated an aggregated distribution map Q ∈ R C×H W from a proposed probability aggregation (PA) module, which exploits the information of correctly estimated pixels of the same class in an input image. Figure 5 shows the process of the PA module in detail. To compute Q , we generate a set of class masks A = {A c ∈ R C×H W } c=1,2,...C and a correct mask B. To generate a class mask A c that corresponds to the cth class, we create a binary mask M c ∈ R H ×W for the cth class using a downsampled ground-truth image. An element of M c in each spatial position has a value of 1 when the ground-truth label corresponds to class c, and 0 otherwise. Furthermore, we reshape M c to generate a one-dimensional vector φ c ∈ R 1×H W . Subsequently, we concatenate the φ c vector C times along the column axis to generate A c . However, to create a correct mask B ∈ R C×H W , we generate a binary map V ∈ R H ×W , where each element of V is 1 when the predicted label using Z is correct, and 0 otherwise. We reshape V to generate ψ ∈ R 1×H W . Subsequently, the correct mask B is obtained by concatenating the ψ vector C times along column axis. Subsequently, Q and Q ∈ R C×H×W are defined as where is an element-wise multiplication operation, and ⊗ is a matrix multiplication operation. ↑ (·) is an upsampling operation that uses a bilinear interpolation. X ∈ R C×H W is a probability distribution map obtained by performing the softmax operation along the channel axis for each pixel from Z and then reshaping it. Q is the upsampled result of Q .
It is noteworthy that the aggregated distribution Q at the early iteration is not sufficiently accurate, because only a few pixels are correct in the early iteration. Hence, we adaptively combined Q and the uniform distribution U as a function of the current iteration number τ. Subsequently, the fused probability distribution mapP τ ∈ R C×H×W at iteration τ is defined asP whereP τ (i) is the distribution vector at the ith pixel inP τ . U is a uniform distribution vector where each element is 1/C. T is the total iteration number, and τ is the current iteration number. Here, ε represents the ratio of the current iteration τ to the total iterations T, similar to [75]. Generally, aggregated distribution and one-hot encoded distributions exhibit different properties. The former reduces the variance error, whereas the latter reduces the bias error [28]. Therefore, we combinedP τ and a one-hot encoded distribution map Y ∈ R C×H×W to reap the advantages of both and then generated the final soft label P τ ∈ R C×H×W , called a pixel-wise adaptive label smoothing (PALS). Here, P τ (i), the probability distribution vector of the ith pixel in P τ at iteration τ is defined as where Y(i) is a one-hot vector at the ith pixel in Y, and α is the weighting factor between two vectorsP τ (i) and Y(i). It is noteworthy that, at the initial iteration, where ε is 0, P τ (i) is the same as Y s (i) in Equation (3). As iteration progresses, ε increases up to 1, and the uniform distribution U in Equation (3) is replaced with the pixel-wise aggregated probability Q in Equation (9).

Loss Function
The loss function L PALS for training the network is defined as where P τ (i) andP τ (i) are the proposed soft target defined in Equation (11) and the predicted distribution of the ith pixel at iteration τ, respectively. We computed our loss function using the KL divergence between the two distributions.

Experiments
In this section, we compare our proposed method with previous methods and analyze the effectiveness of our proposed method based on various experimental settings. Further details are provided in the following subsections.

Dataset
To perform evaluations, we used the Cityscapes [29] dataset and the Pascal VOC2012 [30] dataset for semantic segmentation. The Cityscapes dataset includes urban scenes for semantic segmentation, and it contains 30 classes; however, we used only 19 classes for training and testing, similar to previous studies [9][10][11]. Each image exhibited a high resolution of 2048 × 1024. The dataset contains 5000 pixel-level finely annotated images and 20,000 coarsely annotated images. In the finely annotated images, 2975/500/1525 images are allocated for training, validation, and testing, respectively. We used only finely annotated images for training. The Pascal VOC2012 dataset [30] is one of the most competitive semantic segmentation datasets. It contains 21 classes, including 20 foreground classes and 1 background class. This dataset consists of 10,582 training, 1449 validation, and 1456 test images.

Implementation Details
Our method was applied to the DeepLab-V3+ [10] model, with Xception65 [76] and ResNet18 [77] as backbone networks. The former is a deeper and heavier network than the latter. We initialized the backbone networks using weights pretrained on the Ima-geNet [78] dataset, whereas the weights of other modules, such as the ASPP module [10], were randomly initialized. To train the networks, we set the initial learning rate to 0.02, and we used the polynomial learning rate scheduler with factor (1 − ( τ T ) 0.9 ) using SGD optimization. For unbiased comparisons, we used the same hyperparameters, including a batch size of 8, and 200 epochs for all the experiments. For the Cityscapes dataset, to evaluate the accuracy of the networks for a limited amount of training data, we randomly selected 10%, 30%, 50%, and 100% of the images from the original training dataset, where each proportion comprises 297, 894, 1487, and 2975 training images, respectively. For data augmentation, we performed random horizontal flipping and random-scale cropping. The random scale range was (0.5, 2.0), and the cropping size was 384 × 384. During training, half-size images were used to reduce memory consumption, and full-size images were used on the validation and test data after the results were upsampled. To identify a suitable weighting factor α in Equation (11) For the Pascal VOC2012 dataset [30], we set mostly the same parameters as those of the Cityscapes dataset except the cropping size and the number of training epochs. Our method was applied to the DeepLab-V3+ [10] with Xception65 [76] as the backbone network for the Pascal VOC2012 dataset. The cropping size was set at 480 × 480, and the number of training epochs was 100. We randomly selected 10%, 30%, 50%, and 100% of the images from the original training set, where each proportion comprises 1059, 3175, 5291, and 10,582 training images, respectively.

Comparison with Previous Methods
We compared our proposed method with previous methods, including CE [10], LS [20], and CP [22]. For LS [20] and CP [22], we empirically determined the values of λ in Equation (3) and β in Equation (5) that the best performance was achieved using λ = 0.2 and β = 0.1. For unbiased comparisons, we used the same limited training data for all the comparative methods. Table 2 lists the mIoU results of the Cityscapes training, validation, and test data for DeepLab-V3+ with the Xception65 network. Each column represents the data ratio used in the training data, and each row represents a different method. Each method was trained three times, and the average values of the mIoU and corresponding variances are listed in Table 2. It is noteworthy that all methods suffer from overfitting when the amount of training data were sufficiently small. The accuracy for the training data was favorable, whereas that of the validation and test data decreased significantly. The results for the validation data show that our method yielded the best accuracy, except for the results based on only 10% of the data. Meanwhile, based on the results of the test data, our method show the best accuracy for all data ratios as compared with the other methods. LS generates soft labels by adding a uniform distribution to a one-hot vector, which results in a better accuracy than the baseline CE method. However, LS is suboptimal for the regularization function because it does not consider the correlation between classes. CP regularizes the distribution by subtracting its entropy. CP performs worse as the amount of training data decreases because the entropy of the distribution is already large, particularly when the training data are limited. Figure 6 shows the qualitative comparison results of different methods for DeepLab-V3+ [10] with the Xception65 [76] network on the validation data. The ratio numbers in the first column in Figure 6 denote the data ratios used for training from the original training set. It is observed that our method generates results with more accurate boundary regions and less noise for homogeneous regions in the train, truck, and bus objects compared with other methods. For the 10% training data, the results of most methods include severe errors and ambiguous boundaries for the pole and bus classes, which contain fewer labeled pixels than the other classes. By contrast, our method yields more accurate and clearer boundaries for those classes. Similarly, for the 30% training data, our method yields less noise for the object boundaries of cars and buildings as compared with the other methods. For the 50% training data, our method predicts the boundaries of trucks more clearly as compared with other methods. For the 100% training data, our method yields predictions that are better than those of LS [20] and CP [22] for the bus objects, and better than that of CE [10] for the buildings. Table 3 shows the mIoU results of various methods for DeepLab-V3+ [10] with the ResNet18 backbone [77], which is a lighter networks than DeepLab-V3+ [10] with the Xception65 backbone [76]. Table 3 shows that our method achieves the best accuracy, except for the results based on only 100% validation data and 50% test data. As shown in Table 3, LS performs better than CE for most cases, whereas CP [22] is less accurate than CE [10] for the 10% and 50% validation and test data, respectively. This is because a light network typically exhibits lower confidence in term of probability distribution compared with a heavy network [79]. Therefore, CP [22] resulted in reduced accuracy because it enlarged the entropy of the probability distributions.  Figure 7 shows the qualitative comparison results of different methods using DeepLab-V3+ [10] with the ResNet18 [77] network on the validation data. Because a light network was used, these results indicate less accurate performance than the heavy network. However, our results show clearer boundaries and less noise compared with the other methods. For the 10% training data, our method yielded better predictions than the other methods for rider objects. For the 30% and 50 % training data, our method yielded more accurate results, particularly for truck objects, compared with the other methods. For the 100% training data, our method yielded better predictions for train objects compared with the other methods. Table 4 shows the mIoU results of various methods for DeepLab-V3+ [10] with the Xception65 network [76] on the Pascal VOC2012 dataset. It is observed that our method achieves the best accuracy for all the cases in validation and test data. Specifically, our method achieved mIoU improvements of 1.447%, 0.713%, 3.185%, and 1.153% for 10%, 30%, 50%, and 100% training data compared with the baseline method, respectively. LS performs better than CE for all cases, whereas CP [22] is less accurate than CE [10], except when using 100% training data.  Figure 8 shows the qualitative comparison results of different methods using DeepLab-V3+ [10] with the Xception65 [76] network on the validation data. Since our method aggregates distributions using correctly estimated pixels based on the pair-wise feature similarity, the objects in our results have more accurate boundaries and less noise compared with other methods that independently estimate each pixel. For the 10% training data, for example, the predicted result for the bird class of our method is more accurate than others. For the 30% and 50% training data, most other methods incorrectly predict the dog class and the table class, respectively. In contrast, our method correctly estimates them. For the 100% training data, our method yields better predictions for a complex table class consisting of a multitude of small objects compared with other methods.

Ablation Study
As introduced in Section 4, our proposed method generates a pixel-wise adaptive soft label P in Equation (11) and uses it to define a loss function. When generating P, we used multiple components, including the same-class masks A, the correct mask B, a uniform distribution U, and an adaptive weight ε. To investigate the effectiveness of our proposed method, we conducted experiments where each component was removed from our original method. Table 5 shows the mIoU values obtained when each component was removed from our original method. The first row shows the mIoU results of our original method, defined by Equation (11). Here, "without class mask A" represents a method where the mask A is removed in Equation (9) by setting all elements in A to 1. Furthermore, "without correct mask B" represents a method where the mask B is removed in Equation (9) by setting all the elements in B to 1, and "without uniform distribution U" represents a method where the uniform distribution U is removed from Equation (10) by setting the value of ε to 1 for all the iterations. Lastly, "without adaptive weight ε" represents a method that fixes the weight ε to 0.5 in Equation (10) for all the iterations instead of using it in an adaptive manner. By comparing the results presented in the first and second rows in Table 5, the effectiveness of using A was observed. When pixels of different classes participated in the computation of Q, the probability distributions of the pixels were mixed with those of the other classes, which resulted in less accurate results. By comparing the results of the first and third rows in Table 5, the effectiveness of using B was observed. When the incorrectly estimated pixels participated in the computation of Q, the erroneous probability distributions of the pixels contaminated the final soft targets, which resulted in less accurate results. By comparing the first and fourth row results in Table 5, the effectiveness of using the uniform distribution U was observed. In the early iteration of the training step, the aggregated distribution Q was inaccurate because many pixels were incorrect pixels. Hence, a uniform distribution U is more beneficial than Q in the early iterations. The effectiveness of the adaptive weight ε was investigated by comparing the first and last rows in Table 5. If we fix the value of ε to 0.5, then the weights of Q and U will be the same for all iterations. Because Q contains reliable information in the later iterations, it cannot fully function adaptively when ε is fixed.
On the other hand, to investigate the effect of varying α in Equation (11), we performed several experiments by changing α values to {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5}. Table 6 shows the mIoU values of the proposed method on the Cityscapes validation data using DeepLab-V3+ [10] with the Xception65 network [76] as a function of the α values. It is observed that our method generates similar performances for various α values for most cases. When α is 0.2, our method generates the best performance, except for the 50% dataset case. Based on these results, we fixed α at 0.2 for all the comparative evaluations. Table 6. mIoU values of our method by varying α values on the Cityscapes validation data. The value of α represents the weighting factor in Equation (11). Bold expressions indicate the best accuracy.

Conclusions
We have proposed a pixel-wise adaptive label smoothing (PALS) method via selfknowledge distillation to train semantic segmentation networks for limited training data. In this regard, we aggregated the distribution of each pixel to fully utilize redundant information in an image by computing a similarity matrix that encodes the correlations between pairs of pixels. Based on the similarity matrix, we proposed a soft label by progressively adding a one-hot encoded label and the aggregated distribution for each pixel as a function of iteration. Our method yielded the most accurate results for various ratios of limited training data on the Cityscapes dataset and the Pascal VOC2012 dataset compared with previous regularization methods using DeepLab-V3+ with the Xception65 and ResNet18 networks.