You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

20 January 2022

Semantic Segmentation Based on Depth Background Blur

,
and
1
School of Mathematics and Statistics, Sichuan University of Science and Engineering, Zigong 643000, China
2
Department of Computing Science, University of Alberta, Edmonton, AB T6G 2H1, Canada
*
Author to whom correspondence should be addressed.
This article belongs to the Topic Artificial Intelligence in Sensors

Abstract

Deep convolutional neural networks (CNNs) are effective in image classification, and are widely used in image segmentation tasks. Several neural netowrks have achieved high accuracy in segementation on existing semantic datasets, for instance PASCAL VOC, CamVid, and Cityscapes. However, there are nearly no studies on semantic segmentation from the perspective of a dataset itself. In this paper, we analyzed the characteristics of datasets, and proposed a novel experimental strategy based on bokeh to weaken the impact of futile background information. This crucial bokeh module processed each image in the inference phase by selecting an opportune fuzzy factor σ , so that the attention of our network can focus on the categories of interest. Some networks based on fully convolutional networks (FCNs) were utilized to verify the effectiveness of our method. Extensive experiments demonstrate that our approach can generally improve the segmentation results on existing datasets, such as PASCAL VOC 2012 and CamVid.

1. Introduction

In recent years, an increasing number of researchers have applied convolutional neural networks (CNNs) to resolve pixelwise and end-to-end image segmentation tasks, e.g., semantic segmentation [,,,]. Semantic segmentation can be understood as the need to segment each object in an image and annotate it with different colors. For instance, people, displays, and aircraft in the PASCAL VOC 2012 dataset were marked in pink, blue, and red respectively. As a significant role in computer vision, semantic segmentation has been widely implemented for fields like autonomous driving [], robot perception [], augmented reality [], and video surveillance [].
Since the advent of fully convolutional networks (FCN []), they have greatly simplified the conventional approach to address the conundrum of semantic segmentation. Various end-to-end network architectures derived from FCN have been proposed over the years. Based on existing datasets, the segmentation accuracies are relatively high or even the maximum possible. The series of DeepLab [,,,] proposes atrous convolution with dilation to improve the problem of a scarce receptive field caused by an insufficient amount of down-sampling. The proposed atrous spatial pyramid pooling, to carry out multi-scale feature fusion, significantly advances the accuracy of network segmentation. Yu et al. [] proposed the bilateral segmentation network, which better preserves the spatial information of the original image while ensuring a sufficient receptive field. From semantic segmentation to real-time semantic segmentation [,,], considering redundant to lean network architectures, existing scholars accomplish better segmentation by designing and improving the structure of the network itself and adopting massive data augmentation methods. However, they ignored the impact of the characteristics of the dataset itself on the segmentation results.
Semantic segmentation, as a pixelwise classification task, requires the classification of every pixel. Nevertheless, not every pixel is of interest to us. A substantial amount of background information during the training phase not only increases the difficulty of learning, but also leads to misclassifications, see Figure 1. In view of the aforementioned issues, and motivated by the excellence of the self-attention mechanism in the segmentation task [,], we apply the attention mechanism to the dataset itself. Through in-depth analysis of the dataset, we propose the background blur module bokeh. The overall structure is shown in Figure 2, and a feasible strategy for selecting the fuzzy factor σ is proposed in Section 3.
Figure 1. An example in the PASCAL VOC 2012 dataset. (a) Chairs, sofas, and optical modems in the field of interest have the same color and shape as the walls and wardrobes in the background. (b) Areas of interest only occupy an unobtrusive part of the whole picture.
Figure 2. (a) Existingmainstream semantic segmentation network architecture. (b) The segmentation network architecture of this paper. (c) bokeh module, see Section 3 for algorithm details.
Humans, as the most sophisticated creatures on earth, have a natural advantage in patten recognition. Relying on foveated and active vision, our visual center always focusses on the area of interest to us, rather than on the background.
The proposed bokeh module mainly performs a certain degree of background blurring according to the distribution of various categories of the dataset itself, without any prior knowledge of the domain of interest. It combines the blurred background with the domain of interest as the subsequent segmentation network input. Specifically, during the training stage, background and foreground are divided accurately by real semantic labels provided by the training set. In the validation phase, the original segmentation network is able to separate the background and foreground, based on the coarse segmentation. The reconstructed segmentation network with the bokeh module performs the final semantic segmentation. The visualization of our bokeh module is shown in Figure 3.
Figure 3. Image visualization after adding the bokeh module: (a) original image, (b) the domain of interest φ I , (c) the background blurred domain φ B , and (d) Image-bokeh, namely a combination of the latter two.
We demonstrate the effectiveness of our approach on two datasets, PASCAL VOC 2012 [] and CamVid [], and on several existing end-to-end network architectures.
The main contributions of our paper are:
  • Semantic segmentation is viewed from the dataset itself, and an inference strategy based on the background blurring module (bokeh) is easy to embed into existing semantic networks.
  • According to the characteristics of each dataset, an appropriate strategy for selecting the fuzzy factor σ is proposed.
  • It is verified from the FCN-based network that our bokeh module can improve the segmentation quality of the network without changing any network structure. The segmentation results of BiSeNet [] on CamVid are improved by 3.7 points, while the performance of HyperSeg [] on the PASCAL VOC 2012 is improved by 5.2 points after adding the bokeh module.

3. Proposed Method

In this section, we will elaborate on the bokeh algorithm. We will begin with a brief description of the symbols used. Suppose the whole image is divided into interest and background domains. φ I denotes the collection of all the pixels that we are interested in; otherwise, the collection of all pixels that we are not interested in is φ B . The distribution relationship can be simply expressed as Figure 4. For two matrices A = [ a i j ] R m × n and B = [ b i j ] R m × n , matrix C = [ c i j ] R m × n is called the Hadamard product [] of matrices A and B if matrix C satisfies the following condition:
c i j = a i j × b i j
Figure 4. The distribution map of the interested domain φ I and background domain φ B .
We found that, on some of the datasets used for semantic segmentation (e.g., PASCAL VOC 2012, etc.), there is an imbalance between different categories in the dataset, and an imbalance between the interest and background domains. For example, of Bicycle (0.29%) and Person (4.57%), both of which are of interest, the former is only one sixteenth of the latter. Moreover, the ratio of the interest domain to background domain is about 1:3, as shown in Table 2. This is not favorable for the segmentation task. A considerable amount of background information either increases the difficulty of training or has no effect on the improvement of segmentation accuracy. Some areas of the background domain may resemble some areas of the interest domain. Learning more background information weakens the role of the usable information. We are more inclined to play down the impact of background information on learning. Learning more and more useful information improves the segmentation accuracy of all categories. We use the effective labeling information in the existing labels to obtain the interest domain of a training set.
Table 2. Proportion of 21 categories (with background) on the PASCAL VOC 2012 [] training dataset.

3.1. The Algorithm of Bokeh

For any input image and the corresponding label in the training stage, i m g R H × W × C , G T R H × W , φ R H × W × C , where H × W is the size of the image, and C represents the number of image channels (for RGB, C = 3 ), the background blur module bokeh can be summarized as follows
φ ( i m g , G T , σ ) = φ I ( i m g , G T , 1 ) + φ B ( i m g , G T , σ )
φ I ( i m g , G T , 1 ) = i m g ( i , j , k ) × [ I ( i , j ) B G L ( i , j ) ] × 1
φ B ( i m g , G T , σ ) = i m g ( i , j , k ) × B G L ( i , j ) × σ
where:
0 i H 1 , 0 j W 1 , k = 1 , 2 , , C , σ ( 0 , 1 ]
We denote φ I ( i m g , G T , 1 ) as the interest domain, and φ B ( i m g , G T , σ ) as the background blur domain. When B G L ( i , j ) = 1 , it means that the pixel at ( i , j ) belongs to the background; otherwise, it belongs to the interest domain. I is an H-by-W matrix of ones. “∗” is the matrix Hadamard product operator. σ denotes the fuzzy factor, whose value is inversely proportional to the degree of blur. Its selection strategy will be given later. Background label variable B G L ( i , j ) can be presented as:
B G L ( i , j ) = 1 , G T ( i , j ) = 0 0 , O t h e r w i s e
Assume that R B is the proportion of the background field in an image, and R I is the proportion of the field of interest. Obviously, we obtain R B + R I = 1 , where R B and R I are defined as follows:
R B = N u m ( φ B ) H × W , R I = N u m ( φ I ) H × W
where N u m ( φ B ) is the sum of the pixel number of the background domain, and N u m ( φ I ) is the sum of the pixel number of the interested domain. For PASCAL VOC 2012 train dataset, N u m ( φ B ) = N u m ( p i x e l = 0 ) + N u m ( p i x e l = 255 ) , N u m ( φ I ) = i = 1 20 N u m ( p i x e l = i ) .
For the selection of the fuzzy factor σ , we initially set R B * equal to the background rate of the whole dataset (e.g., for PASCAL VOC train dataset, R B * = 0.7481 ). Suppose σ [ 1 R B * , 1 ] . The degree of background blur of each image depends on the distribution of its own background. If its background proportion is larger, the background blurring degree should be aggravated. Hence, σ should be smaller. Conversely, σ should be greater. Specifically, when R B = 0 , only the field of interest is involved, the corresponding background blurred factor σ should be maximized. When R B = 1 , the background domain is barely included, the σ should be minimized. Let R B and σ satisfy the linear relation:
σ = α × R B + β
such that
R B = 0 , σ = 1 R B = 1 , σ = 1 R B *
where α , β R .
We solve (7), and obtain α = R B * , β = 1 , Thus, σ could be recasted by
σ = R B * × R B + 1
Substituting (3), (4) and (8) into (2), we obtain the formula for the evaluation of the bokeh:
φ ( i m g , G T , σ ) = i m g ( : , : , k ) × [ I ( : , : ) R B * × R B × B G L ( : , : ) ]
where B G L ( : , : ) is defined in (5).

3.2. The Main Mechanism of Bokeh

The reason why CNNs can achieve various classification tasks is that, after a series of convolution and pooling operations, networks are able to infer the abstract representation (also called advanced feature map) of the input image. The ability of abstract representation depends not only on the performance of the network but also on the characteristics of the input image. Finding differences between similar objects is much more difficult than finding differences between different objects. For example, we use the same dichotomous network to classify apples and bananas, or tomatoes and cherry tomatoes. The latter is obviously more difficult, precisely because similarities weaken the differences between different objects.
Since semantic segmentation often requires a large amount of sample data, similarities between categories inevitably exist. Therefore, the proposed bokeh method enables differences between categories amplified. Assuming that there are two similar objects in an image, A b and A o , where A b is marked as background and A o is marked as category I in GT. In a training iteration of the network model, abstract representation learned from A b is denoted as F A b , and the abstract representation learned from A o is denoted as F A o . Obviously, F A b and F A o also have some representation elements in common. The segmentation network learns similar high-level features from two different categories of objects, so it is hard for the network to figure out the features of the category of interest A o . After bokeh and fuzzy operations imposed on A b , the similarity between F A b and F A o is abated during iterations, and the network can gradually learn the features of A o properly.

3.3. Embed into an Existing Network

According to (8), the selection of fuzzy factor σ is merely related to the background rate( R B * and R B ). Therefore, for different datasets, we only need to calculate the proportion of each category before using the bokeh method, which can be used as a general method. With the FCN network, the overall network architecture after adding our bokeh module is shown in Figure 5. For each block, the convolution stride is one and the stride of pooling is two. There are two more convolution operations with stride 1 after “Conv5 + pool”. After each downsampling, the size becomes half of the original. The decoder consists of outputs of three different structures. First, FCN-32s(Output_1) is obtained from the results of “Conv5 + pool” through 32x upsampling. The output of Conv5 + pool with a 2x upsampling is added to the output of “Conv4 + pool” to obtain A d d _ 1 . Then, FCN-16s(Output_2) is acquired from A d d _ 1 through 16x of upsampling. Similarly, as shown in the figure above, we can obtain FCN-8s (Output_3). Detailed structure is shown in Table 3. The network backbone can use AlexNet [], VGGnet [], and GoogLeNet [].
Figure 5. The FCN [] network architecture after adding bokeh is illustrated in the figure above. The entire network can be subdivided into bokeh, encoder and decoder. Bokeh is described in Section 3. The encoder is composed of 5 “ConvX + pool” downsampling blocks. The value of “*” is GT (ground truth) in the training stage and CP (coarse prediction) in the verification stage.
Table 3. Detailed structure of the network. Backbone: VGG16, Input size: 512 × 512 , C: the number of object classes.

4. Experimental Results

FCN [], BiseNet [], and HyperSeg [] are selected as our segmentation networks, and relevant experiments are carried out on PASCAL VOC 2012 [] and CamVid [] benchmarks. A brief review of the corresponding datasets and metrics will be first presented. Following this, implementation and explanation of experiments will be given.
PASCAL VOC 2012: As one of the rockstars used for semantic segmentation, PASCAL VOC 2012 covers not only indoor and outdoor scenes, but also night-time scenes with a total of 21 semantic categories (20 categories of interest and a class for the background). The whole dataset contains 4369 images, 1464 of which are used for training, 1449 for validation and 1456 for testing. The training set and validation set adopt full annotation, while the test set does not provide labels. The capacity has been later expanded in SBD [] to reach 10,582 training samples.
CamVid: As a small-scale urban street view dataset, CamVid includes a total of 701 fully annotated images, 367 of which are employed to train, 101 for validation and 233 for testing. The CamVid dataset consists of 11 semantic categories (e.g., cars, buildings, billboards, etc.) and an Unlabelled class. Each image has the same resolution: 720 × 960 .
Metrics: Let n i j be the number of pixels that class i is predicted to be class j, and C be the number of object classes (including the background class). We compute four indices: Pixel Acc, Mean Acc, Mean IOU and F.W IOU, as defined below. Naturally, the higher the values are, the better network performance is.
  • Pixel accuracy(Pixel Acc): i = 1 C n i i / i = 1 C j = 1 C n i j ;
  • Mean pixel accuracy(Mean Acc): ( 1 / C ) i = 1 C n i i / i = 1 C j = 1 C n i j ;
  • Mean intersection over union(Mean IOU): ( 1 / C ) i = 1 C ( n i i / ( j = 1 C n i j + j = 1 C n j i n i i ) ) ;
  • Frequency weight intersection over union (F.W IOU): ( 1 / i = 1 C j = 1 C n i j ) i = 1 C ( ( j = 1 C n i j n i i ) / ( j = 1 C n i j + j = 1 C n j i n i i ) ) .

4.1. Implementation Protocol

We reconstruct the classical FCN [], BiSeNet [], and HyperSeg [] network. In order to more objectively evaluate the impact of background information on segmentation accuracy, we remove all data augmentation (except for cropping size) in the original paper. The reconstructed networks are represented by (Re)FCN-8s, (Re)FCN-16s, (Re)FCN-32s, (Re)HyperSeg, and (Re)BiSeNet, respectively. Our reconstruction results are a little bit lower than the original results because we did not add a mass of data augmentation. However, our focus is to demonstrate the feasibility of our method, rather than narrowing the gap with the original paper.
Training details: For the CamVid [] dataset, an Adam optimizer was used, with batch size 8, initial learning rate 1 × 10 4 , and weight decay 1 × 10 4 in training. Similar to Deeplab series [,,], the “poly” learning rate attenuation strategy was also adopted, and the last learning rate was multiplied by ( 1 i t e r i t e r m a x ) p o w e r , where p o w e r = 0.9 , after each iteration. For the PASCAL VOC 2012 [] dataset, parameters were set with batch size 12, and weight decay 2 × 10 4 in training. After every 50 epochs, the learning rate decayed to half of the last one.
Data augmentation: No additional operations are required except clipping. For Camvid, images processed by SegNet [] are used as input in this paper, and these images are 360 × 480 . The PASCAL VOC 2012 dataset is clipped to a fixed size as input.

4.2. Ablation for Bokeh

Applying bokeh to multiple segmentation networks on two datasets, comparative experiments were made. The experimental results of three FCN network architectures and HyperSeg on the PASCAl VOC dataset are shown in Table 4. As can be seen, the mean IOU of the four mentioned network architectures ((Re)FCN-32s, (Re)FCN-16s, (Re)FCN-8s, and (Re)HyperSeg) with bokeh are improved by 4.7, 4.6, 4.8, and 5.2 points, respectively. At the same time, the specific precision of FCN-8s before and after adding the bokeh module on the PASCAL VOC 2012 Val dataset is given, as illustrated in Table 5. Note that the segmentation accuracy of an individual category is significantly improved, except for 5 out of 84 comparison items.
Table 4. The segmentation performance of three variants of FCN on the PASCAL VOC 2012 validation set, “(Re) FCN-XXs + bokeh” means the bokeh module is added to “(Re)FCN-XXs”.
Table 5. Comparison of detailed accuracy of FCN-8s before and after adding bokeh module on the PASCAL VOC 2012 Val dataset.
Considering the results in Table 2 and Table 5 together, we find relatively small categories, such as bicycle (0.29%), boat (0.58%) and potted plant (0.64%), make a significant contribution to accuracy improvement. Note that the segmentation accuracy of exceptional categories, such as “Cow” and “Dinning Table”, decrease inversely. This is because while blurring the background, it depresses the context information to some extent. We considered it from two aspects. One is to employ the fusion of “(Re)FCN-8s” and “(Re)FCN-8s + bokeh”, named “(Re)FCN-8s + Fusion”. Another is to confine the scope of the background blur field to ensure rich context information is preserved, named “(Re)FCN-8s + Shrink.” The experimental results demonstrate, as shown in Table 5, that these two methods can avoid the accuracy decrease in specific categories. However, the improvement of the overall segmentation accuracy is not as good as before. As a result, the accuracy decrease in some individual categories is permitted. Qualitative examples on this dataset are shown in Figure 6.
Figure 6. Examples of the output before and after adding the bokeh module on the PASCAL VOC 2012 dataset, our resulting contour is much smoother. The first three rows are the results of experiments on FCN, and the last four rows are the results of experiments on Hyperseg. (a) Image; (b) (Re)Seg.Network; (c) (Re)Seg.Network+bokeh; (d) Ground Truth.
The bokeh module improves the segmentation results on the CamVid val dataset by 3.7 in Mean IOU, as shown in Table 6. This demonstrates that the proposed bokeh module is easily embedded into a real-time network architecture. In view of the consequences of PASCAL VOC 2012, bokeh plays a vital role in the class with a small proportion of the dataset. As for CamVid, the background occupies a relatively small proportion, but the accuracy increase is clear. The proportion of the CamVid training dataset by category is presented in Table 7.
Table 6. Accuracy result on the CamVid val dataset. After adding the bokeh module, the result is improved by 3.7 points.
Table 7. The category proportion of the CamVid training dataset.
It is clear from Table 8 that the improvement in the Unlabelled category is the most significant. Analyzing the characteristics of CamVid, it can be observed that low background proportion of the dataset itself and diversity of categories in each image leads to this. After adding the bokeh module, the image changes are not noticeable compared to the ones before the fuzzification. However, this slight improvement occurs in almost all of the Unlabelled category. Qualitative examples on this dataset are given in Figure 7.
Table 8. Accuracy result on the CamVid val dataset.
Figure 7. Examples of the output before and after adding the bokeh module on the CamVid dataset. It is obvious that the network is sensitive enough to recognize the background (the black part in picture (b,c)) after adding the bokeh module. The first row is the results of experiments on BiSeNet and the second row on HyperSeg. (a) image; (b) (Re)Seg.Network; (c) (Re)Seg.Network+bokeh; (d) ground truth.
We compared the advanced feature maps of 20 channels (excluding background channels) in the last layer of HyperSeg [] before and after adding bokeh. The sensitivity of the network to the categories of interest is higher after adding bokeh, as shown in Figure 8.
Figure 8. Comparison of advanced feature maps and prediction: (a) feature map and prediction of the HyperSeg network before adding bokeh, the last row is the input image. (b) Feature map and prediction of the HyperSeg network after adding bokeh, the last row is the ground truth. The feature map does not include background channels.

5. Conclusions

In this paper, we propose a semantic segmentation method based on background blurring, which adaptively processes the input image background via the fuzzy factor σ , without changing the original network structure or introducing additional parameters, to expand differences between background and foreground and guide the network segmentation. The selection of σ is determined by the overall background rate R B * of the dataset and the background rate R B of the current image. The former determines the approximate range of its value, while the latter determines its specific value. Compared to the attention mechanism in the network layer, bokeh plays the same role in the dataset, by weakening the background information to highlight the features of the foreground. Moreover, our approach can be lightly embedded into the existing segmentation network. As our experiments show, our method achieves competitive performance on PASCAL VOC 2012 and CamVid, with mean IOU increased by 5.2 and 3.7, especially for the small proportion category in the dataset. The main limitation of this study is that our bokeh method relies on the existing segmentation network, and the performance of the existing segmentation network directly determines whether we can accurately trace the background. Different segmentation networks selected may result in diverse results. Therefore, a natural progression of this work is how to efficiently segment the foreground and background without relying on the current network. In addition, adding classical image processing methods and how to encode and decode contour information effectively will be the focus of this paper in the future.

Author Contributions

Writing—original draft, H.L.; Writing—review and editing, C.L. and A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by Sichuan Science and Technology Program grant number 2019YJ0541, the Open Project of the Key Lab of Enterprise Informationization and Internet of Things of Sichuan Province grant number 2021WZY02, the Open Fund Project of Artificial Intelligence Key Laboratory of Sichuan Province grant number 2018RYJ02, Postgraduate course construction project of Sichuan University of Science and Engineering grant number YZ202103 and Graduate innovation fund of Sichuan University of Science and Engineering grant number Y2021099.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Shi, H.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  2. Vu, T.-H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. ADVENT: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2512–2521. [Google Scholar]
  3. Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. arXiv 2020, arXiv:1909.11065. [Google Scholar]
  4. Li, Y.; Yuan, L.; Vasconcelos, N. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6929–6938. [Google Scholar]
  5. Cortinhal, T.; Tzelepis, G.; Aksoy, E. SalsaNext: Fast Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving. arXiv 2020, arXiv:2003.03653. [Google Scholar]
  6. Zhang, X.; Chen, Z.; Wu, Q.M.J.; Cai, L.; Lu, D.; Li, X. Fast semantic segmentation for scene perception. IEEE Trans. Ind. Inf. 2019, 15, 1183–1192. [Google Scholar] [CrossRef]
  7. Ko, T.; Lee, S. Novel Method of Semantic Segmentation Applicable to Augmented Reality. Sensors 2020, 20, 1737. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Lin, C.; Yan, B.; Tan, W. Foreground detection in surveillance video with fully convolutional semantic network. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4118–4122. [Google Scholar]
  9. Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
  10. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected CRFs. J. Comput. Sci. 2014, 4, 357–361. [Google Scholar]
  11. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  13. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018, arXiv:1802.02611. [Google Scholar]
  14. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–349. [Google Scholar]
  15. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
  16. Li, H.; Xiong, P.; Fan, H.; Sun, J. DFANet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9522–9531. [Google Scholar]
  17. Lin, G.; Shen, C.; Hengel, A.V.; Reid, I. Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3194–3203. [Google Scholar]
  18. Lin, Z.; Feng, M.; Santos, C.D.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A Structured Self-attentive Sentence Embedding. arXiv 2017, arXiv:1703.03130. [Google Scholar]
  19. Everingham, M.; van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
  20. Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
  21. Nirkin, Y.; Wolf, L.; Hassner, T. HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4061–4070. [Google Scholar]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  23. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1–10. [Google Scholar]
  24. Ignatov, A.D.; Patel, J.; Timofte, R. Rendering Natural Camera bokeh Effect with Deep Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 1676–1686. [Google Scholar]
  25. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
  26. Horn, R.; Johnson, J. Matrix Analysis; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  27. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  28. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  29. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  30. Hariharan, B.; Arbelaez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 991–998. [Google Scholar]
  31. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.