Relevancy between Objects Based on Common Sense for Semantic Segmentation

Zhou, Jun; Bai, Xing; Zhang, Qin

doi:10.3390/app122412711

Open AccessArticle

Relevancy between Objects Based on Common Sense for Semantic Segmentation

by

Jun Zhou

^1,†

,

Xing Bai

^1,†

and

Qin Zhang

^2,*,†

¹

Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China

²

National Science Library, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(24), 12711; https://doi.org/10.3390/app122412711

Submission received: 1 November 2022 / Revised: 25 November 2022 / Accepted: 7 December 2022 / Published: 11 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Research on image classification sparked the latest deep-learning boom. Many downstream tasks, including semantic segmentation, benefit from it. The state-of-the-art semantic segmentation models are all based on deep learning, and they sometimes make some semantic mistakes. In a semantic segmentation dataset with a small number of categories, images are often collected from a single scene, and there is a close semantic connection between any two categories. However, in the semantic segmentation dataset collected from multiple scenes, two categories may be irrelevant. The probability of objects in one category appearing next to objects in other categories is different, which is the basis of the paper. Semantic segmentation methods need to solve two problems of positioning and classification. This paper is dedicated to correcting those clearly wrong classifications that are contrary to reality. Specifically, we first calculate the relevancy between different class pairs. Then, based on this knowledge, we infer the category of a connected component according to the relationships of this connected component with its surrounding connected components and correct the obviously wrong classifications made by a deep learning semantic segmentation model. Several well-performing deep learning models are experimented on two challenging public datasets in the field of semantic image segmentation. Our proposed method improves the performance of UPerNet, OCRNet and SETR from 40.7%, 43% and 48.64% to 42.07%, 44.09% and 49.09% mean IoU on the ADE20K validation set, and the performance of PSPNet, DeepLabV3 and OCRNet from 37.26%, 37.3% and 39.5% to 38.93%, 38.95% and 40.63% mean IoU on the COCO-Stuff dataset, which shows the effectiveness of the method.

Keywords:

semantic segmentation; scene parsing; computer vision

1. Introduction

Semantic segmentation is a fundamental task in computer vision and plays an important part in many scene understanding problems. A semantic segmentation model needs to predict the category of each pixel in an image. In the process of inference, two problems need to be solved: localization and classification.

Semantic segmentation is closely related to image classification. As a fundamental task in computer vision, many surveys [1] have been performed. Most recent semantic segmentation methods follow the fully convolutional network (FCN) [2] and equip it with an encoder-decoder architecture. By removing fully connected layers, the fully convolutional network is able to handle images of any size and obtain pixel-wise predictions. The encoder in an encoder-decoder semantic segmentation model is usually borrowed from the image classification task [3,4,5,6,7], consists of stacked convolution layers, progressively reduce the spatial resolution of feature maps and enlarge the receptive fields of convolution kernels as much as possible to get more abstract/semantic information. The decoder recovers the spatial resolution for pixel-level classification of the feature representations yielded by the encoder. Feature representation learning is the most important model component, which usually is finished by a backbone pretrained on the ImageNet. CNN is famous for the extraction of local features. After removing the fully connected layer, each element in the feature map output by the fully convolutional network has a shortage of encoding long-range dependency information, which is momentous for semantic segmentation and results in coarse results.

To overcome the lack of multi-scale information, many works enlarged receptive fields of convolution operations. DeepLab [8,9,10,11] introduced artous convolution and developed the ASPP module that adopted atrous convolutions with different dilation rates. PSPNet [12] proposed the PPM module to obtain contextual information on different scales and built a feature pyramid. To balance the contradiction between semantic and spatial information, many works [13] aggregated the feature maps from different stages of the backbone, and multiple variants of the encoder-decoder architectures are developed.

However, the aforementioned methods still do not know which part of context information is pivotal for segmentation. To address this question, some works integrate attention modules into FCN-based architectures. PSANet [14] proposed the point-wise spatial attention module for capturing the long-range contextual information. DANet [15] used two attention modules to model long-range dependencies from spatial and channel dimensions, respectively. CCNet [16] introduced a relatively low computation complexity attention module. OCNet [17] attached attention modules to each branch of the PPM module or the ASPP module.

The aforementioned models are still FCN-based models that are decorated with attention modules. More recently, the success of transformer in natural language processing inspires researchers to treat computer vision tasks as sequence-to-sequence tasks and build pure transformers for computer vision tasks. They usually deploy a transformer [18] to encode an image as a sequence of patches. ViT [19] is the first work built with a pure transformer for the image classification task and outperforms other CNN-based architectures. Following the method used in the image classification task, the semantic segmentation task is also regarded as a sequence-to-sequence prediction task. The transformer is proficient in modeling long-range dependencies in the data, SETR [20], Swin transformer [21] and BEiT [22] took full advantage of it and significantly surpassed the FCN-based models.

State-of-the-art models are computationally expensive. Many works focus on speeding up the inference of their models. ERFNet [23] reduced the resolution of the input image. BiSeNet [24] introduced a fast downsampling context path to extract semantic information and a simple spatial path to preserve spatial information. FastFCN [25] proposed the joint pyramid upsampling module to extract high-resolution feature maps.

In this paper, we aim to provide a post-processing method for the semantic segmentation task. It is an attempt and exploration of a new idea. First, we count the frequency of paired occurrences of adjacent connected components in the training set and build a relevancy matrix. The class of the central connected component is highly correlated with the category of the surrounding area. We get prior probabilities from the relevancy matrix. Then, we propose two post-processing methods for the semantic image segmentation task. Based on the average confidence of the central connected component, we apply the appropriate post-processing method for it. Recently, popular methods based on deep learning have focused on how to improve the performance of prediction models. We propose two new post-processing methods to improve the original results of the prediction model, which is different from the prediction methods, and also our innovation. Our proposed post-processing method improves the mean IoU of UperNet50 [26], OCRNet [27] and SETR-MLA on the ADE20K [28] validation set with single-scale inference from 40.7%, 43% and 48.64% to 42.07%, 44.09% and 49.09%, and the mean IoU of PSPNet, DeepLabV3 and OCRNet on the COCO-Stuff dataset [29] validation set with single-scale inference from 37.26%, 37.3% and 39.5% to 38.93%, 38.95% and 40.63%.

In practical applications, in most cases of autonomous driving, it is a street view image that has fewer categories, and the post-processing proposed does not have much effect. For application scenarios, such as unmanned aerial vehicle navigation, due to many objective factors, such as light, shooting angle and weather, the images taken will be partially blurred or include a number of categories. For processing these complex images, the post-processing method will be very useful.

2. Method

In this section, two post-processing methods for deep learning-based semantic segmentation models are introduced.

There are usually many objects in the image in complex application scenarios. The categories of objects that appear in images taken from similar scenes are highly correlated. For example, there are beds, pillows and bedside tables in a bedroom and tables and cutlery in a dining room. On roads, we can see cars, buildings and traffic signs. Under normal circumstances, it is impossible for cars, traffic signs, grass, etc., to appear around a bed. There is a high probability that pillows and quilts will appear on a bed. Therefore, the relationship between two classes may be close or distant. This is not significant for datasets such as cityscapes [30] that focus on a single scene. However, for datasets such as ADE20K that are collected from multiple scenes, the relevancy between classes A and B may differ greatly from the relevancy between classes A and C.

Each annotation in a semantic segmentation dataset is a matrix in which each element is a number representing a category. Inside a thing or a patch of stuff, it is filled with the same number, which makes it a connected component. As shown in Figure 1, there are other connected components around each thing or stuff. As mentioned before, connected components of class B maybe often appear around connected components of class A, while connected components of class C do not. Therefore, we count the frequency of any two adjacent connected components appearing in annotations of the training dataset to find out the distance between any two classes. We denote the frequency of objects belonging to class j appearing adjacent to objects belonging to class i as

f_{i j}

. Frequencies between some categories are shown in Table 1. If region S is adjacent to region C, then we approximate the probability that C belongs to category j under the condition that S is known to belong to category i as

f_{i j}

.

P (C = j | S = i) = f_{i j}

(1)

The same as common conception, the class most closely related to wall is floor, and the least closely related class to wall is lake. Lake never appeared next to wall in the training set. While ceiling and wall share the closest connection. The closeness between many class pairs is 0, such as building and bed, road and cabinet. Therefore, if a model infers a region of an image to be road and it is true, then it is probably wrong that a region surrounded by road is inferred to be a cabinet. This letter is inspired by these common senses.

In the semantic segmentation task, a model needs to localize and classify. In popular methods based on deep learning, various network architectures are designed to extract features. Regardless of the difference between network structures, all these networks will generate feature maps with the same number of channels as the number of categories. Then, feature maps are fed into a softmax layer. The class with the highest confidence is chosen as the prediction for each pixel. Such wrong inferences mentioned earlier are made by bad classifications of a model. In this letter, we only address classification, not localization. In order to correct wrong classifications of objects that deep learning models may make, we propose two post-processing methods to correct such wrong classifications. For each connected component, the categories of all connected components around it have an impact on the classification of the region. Our proposed post-processing methods are based on connected components. As shown in Figure 2, we denote the central connected component as C and the ith connected component as

S_{i}

.

We first calculate the average confidence for each class over all pixels in region C.

{a v g_c o n f}_{k} = \frac{\sum_{i, i \in C} c o n f_{i k}}{n}

(2)

where

c o n f_{i k}

is the predicted confidence that pixel i belongs to class k and n is the number of pixels in region C. The class with the highest average confidence is the class that model predicted for region C.

l_{C} = \underset{k \in K}{argmax} ({avg_conf}_{C})

(3)

where K is the number of categories in the dataset. If the highest average confidence of a connected component is below the predefined threshold, it means that the model’s prediction for this region is not confident enough. These connected components are what we think needs to be processed.

The following describes the first post-processing method. We use the frequencies as the relevancy matrix of all classes. Each category in S has a relevancy vector with other classes. We add the relevancy vectors of all adjacent connected components as the reference

r_{c}

for the central connected component in which we are interested.

r_{c} = \sum_{j \in S} f_{j}

(4)

where

f_{j}

is the frequency vector of the jth surrounding class with other classes, and S is the set of classes to which the connected components around the central component belong. The highest-scored class in the reference is the most likely class to which the region of the image belongs, and the lowest-scored class is the opposite. If we denote the kth highest score in

r

as r, then the set of the top k scored classes denoted as K can be expressed as:

K = {i, r_{i} > r}

(5)

We think that the right class for region C is in K. Therefore, we keep the top k scored classes and discard the rest. If the class that the deep learning semantic segmentation model predicted for region C is not in K, then we change the classification result and choose the class with the highest average confidence among the top k scored classes as the classification result for the region. To this end, we set the average confidence of classes not in K to 0.

a v g_c o n f_{i, i \notin K} = 0

(6)

At this point, the class with the highest average confidence in

avg_conf

is the label that we think region C belongs to.

The category of each pixel in region C predicted by the model is the same. However, when we discard categories that are not in K, the category with the highest confidence for each pixel in region C will not all be the same. If we conduct the post-processing method for each pixel in the central region without considering the integrity, it will split the connected component. Therefore, we select the category with the highest average confidence over all pixels in this region as a result.

The following describes the second post-processing method. Suppose the number of connected components around the central connected component is n. If objects of class A never appear next to objects of these n categories in images from the training set, then the probability that the central connected component belongs to class A approximates 0. However, this setting is obviously too strong, as few connected components meet this condition. Therefore, we weaken this condition. If there are more than k categories whose relevancy with class A is 0 among n surrounding categories, then we believe that the category of the central connected component cannot be class A. In addition, we treat the frequencies close to 0 as 0. We set the average confidence for each class that satisfies the condition to 0, which means we discard the class.

a v g_c o n f_{i, i \in N} = 0 i f \sum_{f_{j i} = 0} j < k a n d \sum_{j \in S} j = n

(7)

where N is the number of categories annotated in the dataset. Similar to the first post-processing method, we select the category with the highest average confidence among the rest of the classes as the classification result of the central region.

3. Experiments

In this section, the experimental results for the two post-processing methods for the semantic segmentation task are introduced. Considering the adequacy of the experiment, two different public datasets were selected. Let us introduce one of them first. The ADE20K semantic segmentation dataset contains 150 categories. It is divided into 20,120, 2000, 3000 images for training, validation and testing. ADE20K is the most popular dataset in the field of image semantic segmentation and one of the most difficult publicly available datasets; it contains many categories and is very close to the scene distribution in the real world. We conduct several experiments on three different logits output by three different deep learning models. They are UPerNet, OCRNet and SETR. UPerNet aimed to parse visual concepts across scene categories, objects, parts, materials and textures from images. UPerNet50 achieved 40.7% mean IoU on the ADE20K validation set with single-scale inference. OCRNet added a cross-attention module to its encoder-decoder architecture. OCRNet achieved 43% mean IoU on the ADE20K validation set with single-scale inference. SETR introduced a sequence-to-sequence prediction framework as an alternative perspective for semantic segmentation. SETR-MLA achieved 48.64% mean IoU on the ADE20K validation set with single-scale inference.

The results of the ablation study on OCRNet are listed in Table 2 to illustrate the effect and difference between the first method and the second method. If the average confidence of a patch is below 0.35, it means that the model is very unconfident in the classification result. In this case, the first candidate obtained from the reference vector has a very high probability of being the correct category, which is why the first method is effective. When we apply the first method to all connected components, the result decreases instead. When the threshold is set between 0.35 and 0.7, the second method is more effective than the first method. It improves the result to 43.86%. When the threshold is between 0.7 and 1, the second method has no influence on the result, and the first method has a side effect for the first method is a stronger condition than the second. Overall, we first calculate the average confidence for each connected component. Then according to the results obtained, we determine which post-processing method to use or not.

The experimental results on three representative models are listed in Table 3. The original results of the three models are all from the implementations by [31]. The experiments in Table 2 have shown that the first method is more effective than the second method when the threshold is below 0.35 and has a side effect when the threshold is above 0.35. According to this finding, the first method is applied to all connected components with an average confidence of less than 0.35, and all connected components with average confidence above 0.35 are not processed. Therefore, it brings the results of the three models to 41.11%, 43.23% and 48.83%, respectively. If we use the first post-processing method on connected components with average confidence less than 0.35, use the second method on connected components with average confidence between 0.35 and 0.7 and connected components with average confidence above 0.7 are not processed, the results of the three models will be improved to 42.07%, 44.09% and 49.09%, respectively. Obviously, the combination of the first method and the second method is more effective than any single method.

In order to further verify the effectiveness of the two post-processing methods, additional datasets and two other methods were selected for experiments. The COCO-Stuff dataset is a challenging scene-parsing dataset that contains 171 semantic classes. The training set and test set consist of 9K and 1K images, respectively. OCRNet was also used for the dataset to compare the challenges of the COCO-Stuff dataset and the ADE20K dataset. Meanwhile, PSPNet and DeepLabV3, two classical image semantic segmentation models, were selected for the experiment. The core contribution of PSPNet was to propose the pyramid pooling module, which can integrate different scale context information, improve the ability to obtain global feature information and increase the expression of the model. DeepLabV3 proposed the atrous spatial pyramid pooling module, which can tap different-scale convolutional features and encode global image feature information to enhance semantic segmentation. Although they were proposed in 2017, it is also appropriate to test the effects of the post-processing methods.

The experimental results from the COCO-Stuff dataset are listed in Table 4. As we can see from Table 4, the COCO-Stuff dataset is more challenging than the ADE20K dataset by the performance of OCRNet. After a combination of the two post-processing methods, the original results of the three models are improved to 38.93%, 38.95% and 40.63%, respectively. They are all improved by more than 1%, and their overall performance gains are higher than the former experiment. Through the comparative analysis of Table 3 and Table 4, the post-processing method has a more significant effect on the more difficult datasets, which is also in line with our common sense.

The final result of the post-processing method proposed in this paper seriously relies on the original mIoU of the deep learning model. Several representative models have been selected to conduct experiments. There may be models that are more suitable for the post-processing method; that is, the gain of post-processing will be greater. Because we have proved the effectiveness of the post-processing method, this paper no longer performs more repeated experiments to verify other models. Based on analyzing the results of the above two groups of experiments and the deep understanding of the principle of post-processing, the post-processing method will have a greater effect gain for processing complex images or blurry images in real application scenarios. Complex images generally include many categories, and the correlation between categories is strong, so the effect will be better after application processing. Due to the shooting angle or the influence of other objective factors, blurred images have some unclear areas, and the effect gain of the post-processing method will be greater for processing such images.

In order to visually demonstrate the effect of the post-processing method, some samples from the ADE20K dataset are presented in the form of comparison, as shown in Figure 3. The first column represents the original prediction result of the deep learning methods, the second column represents the result after the post-processing method, and the third column represents the annotation results in Figure 3. It is easy to see that window, river, bed and light in the examples are corrected by the post-processing method.

Specifically, the large blue area on the window in the first image in the first row is caused by the wrong prediction of a deep learning model in Figure 3, which is corrected in the second image. The color of the river in the second image is different from that in the first image in the second row, but it is consistent with that in the third image, which requires careful observation. The three small connecting areas on the bed in the first image in the third row and the light in the first image in the fourth row are obviously wrongly classified, but they are all correctly corrected through the proposed post-processing method.

4. Conclusions

This paper presents two new post-processing methods for the semantic segmentation task. In particular, we correct the wrong classification of a central connected component inferred by a deep learning-based semantic segmentation model according to the relationships of the central connected component with its surrounding connected components. We experiment with our proposed post-processing methods on the logits of several representative models. After post-processing, the results of these models have all been improved, which shows the effectiveness of our post-processing methods.

As we all know, deep learning methods are based on a large number of labeled data. The larger the amount of data, the better the model performs. However, the cost of labeling data is expensive, and these data-based methods lack explanation. The post-processing method proposed in this paper is based on common sense, so its interpretability is very strong. All in all, our work not only improves the original prediction results of the deep learning model but also has good persuasiveness in the interpretability of the improvement. It is hoped that this new attempt will provide other researchers with new research perspectives.

Of course, our methods also have limitations and deficiencies. In the dataset with few categories, the post-processing method may not have an obvious effect. Because the fewer categories there are, the simpler the logical relationship between the categories is, the less useful this method will be. As an extreme example, if we are dealing with a dataset with only two categories, this method will almost never work.

Author Contributions

J.Z. developed the conceptual idea, proposed the method and wrote the manuscript. X.B. conducted the experiments and revised the manuscript. Q.Z. reviewed the manuscript and provided insightful suggestions to further refine it. All authors contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Youth Innovation Promotion Association of the Chinese Academy of Sciences (E1291902), Jun Zhou (2021025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ADE20K dataset and the COCO-Stuff dataset used for this study can be accessed from http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip and http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/cocostuff-10k-v1.1.zip (accessed on 31 October 2022), respectively.

Conflicts of Interest

The authors declare there is no conflicts of interest regarding the publication of this paper.

References

Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Muller, J.; Manmatha, R.; et al. ResNeSt: Split-attention networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 603–612. [Google Scholar]
Yuan, Y.; Wang, J. Ocnet: Object context network for scene parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Bao, H.; Dong, L.; Wei, F. BEiT: BERT pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 173–190. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 15–17 June 2017; pp. 633–641. [Google Scholar]
Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Contributors, M. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 31 October 2022).

Figure 1. An image and its annotation from the ADE20K dataset.

Figure 2. Illustration of the central connected component and its neighbors.

Figure 3. Examples of comparison before and after the post-processing method.

Table 1. Relevancy between different classes. The numbers in the table indicate the proportion of connected components labeled as one class that is adjacent to connected components labeled as another class. For example, floor is the closest class to wall. A total of 64.49% of the connected components labeled wall are adjacent to the connected components labeled floor. Lake is the farthest class from wall. There are no connected components labeled wall adjacent to a connected component labeled lake.

	Wall	Floor	Bed	Pillow	Sea	Sand	Road	Car	Tree	Lake
wall	1	0.6449	0.1535	0.0529	0.0032	0.0017	0.0294	0.0331	0.0706	0
floor	0.8023	1	0.1746	0.0017	0.0005	0.0003	0.0015	0.0108	0.0042	0
bed	0.9748	0.8910	1	0.4816	0	0	0	0	0.0005	0
pillow	0.6591	0.0172	0.9452	1	0	0	0.0011	0	0	0
sea	0.0569	0.0077	0	0	1	0.3	0.0123	0.0077	0.1631	0
sand	0.0645	0.0097	0	0.0032	0.6290	1	0.0258	0.0065	0.1935	0.0032
road	0.0855	0.0035	0	0	0.0020	0.0020	1	0.6170	0.2243	0
car	0.1212	0.0320	0	0	0.0016	0.0006	0.7789	1	0.3353	0
tree	0.1226	0.0058	0.0001	0	0.0159	0.0090	0.1341	0.1588	1	0.0031
lake	0	0	0	0	0	0.0192	0	0	0.4038	1

Table 2. Ablation study on OCRNet.

Avg. Conf.	The 1st Method	The 2nd Method
[0, 0.35]	43.23	43.15
(0.35, 0.7]	42.88	43.86
(0.7, 1]	42.64	43

Table 3. Experimental results on the ADE20K dataset.

Method	Orig. mIoU	The 1st M.	The 1+2nd M.
UPerNet50	40.7	41.11	42.07
OCRNet	43	43.23	44.09
SETR-MLA	48.64	48.83	49.09

Table 4. Experimental results on the COCO-Stuff dataset.

Method	Orig. mIoU	The 1st M.	The 1+2nd M.
PSPNet	37.26	38.09	38.93
DeepLabV3	37.3	38.12	38.95
OCRNet	39.5	39.94	40.63

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, J.; Bai, X.; Zhang, Q. Relevancy between Objects Based on Common Sense for Semantic Segmentation. Appl. Sci. 2022, 12, 12711. https://doi.org/10.3390/app122412711

AMA Style

Zhou J, Bai X, Zhang Q. Relevancy between Objects Based on Common Sense for Semantic Segmentation. Applied Sciences. 2022; 12(24):12711. https://doi.org/10.3390/app122412711

Chicago/Turabian Style

Zhou, Jun, Xing Bai, and Qin Zhang. 2022. "Relevancy between Objects Based on Common Sense for Semantic Segmentation" Applied Sciences 12, no. 24: 12711. https://doi.org/10.3390/app122412711

APA Style

Zhou, J., Bai, X., & Zhang, Q. (2022). Relevancy between Objects Based on Common Sense for Semantic Segmentation. Applied Sciences, 12(24), 12711. https://doi.org/10.3390/app122412711

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Relevancy between Objects Based on Common Sense for Semantic Segmentation

Abstract

1. Introduction

2. Method

3. Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI