DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation

Jiang, Junzhe; Xu, Cheng; Liu, Hongzhe; Fu, Ying; Jian, Muwei

doi:10.3390/electronics12194059

Open AccessArticle

DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation

by

Junzhe Jiang

^1,†

,

Cheng Xu

^1,†

,

Hongzhe Liu

^1,*,†,

Ying Fu

^2,† and

Muwei Jian

^3,†

¹

Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China

²

School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100811, China

³

School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan 250014, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2023, 12(19), 4059; https://doi.org/10.3390/electronics12194059

Submission received: 25 August 2023 / Revised: 25 September 2023 / Accepted: 25 September 2023 / Published: 27 September 2023

(This article belongs to the Special Issue Deep Learning for Image Analysis and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

With a larger field of view (FOV) than ordinary images, fisheye images are becoming mainstream in the field of autonomous driving. However, the severe distortion problem of fisheye images also limits its application. The performance of neural networks designed for narrow FOV images degrades drastically for fisheye images, and the use of large composite models can improve the performance, but it brings huge time overhead and hardware costs. Therefore, we decided to balance real time and accuracy by designing the deformable segmentation attention(DSA) module, a generalpurpose architecture based on a deformable attention mechanism and a spatial pyramid architecture. The deformable mechanism serves to accurately extract feature information from fisheye images, together with attention to learn the global context and the spatial pyramid structure to balance multiscale feature information, thus improving the perception of fisheye images by traditional networks without increasing the amount of excessive computation. Lightweight networks such as SegNeXt equipped with the DSA module enable effective and rapid multi-scale segmentation of fisheye images in complex scenes. Our architecture achieves outstanding results on the WoodScape dataset, while our ablation experiments demonstrate the effectiveness of various parts of the architecture.

Keywords:

deformable attention; spatial pyramid; fisheye image segmentation

1. Introduction

As an intensive visual identification problem, semantic segmentation has been an important foundation for scene interpretation in autonomous driving scenarios. Mainstream segmentation algorithms have been created to analyze pinhole camera images with a relatively limited field of view (FOV) [1], where the FOV rarely exceeds

120^{\circ}

and little information is gathered. Fisheye images, on the other hand, have an FOV of

180^{\circ}

to

270^{\circ}

and can provide a more comprehensive view of the scene, lowering the number of lenses required and removing the image stitching process; hence, they are becoming increasingly popular in autonomous driving systems.

However, the severe radial distortion of the fisheye image makes it mismatched with the structure of the common segmentation model [2], resulting in a 20–30% (MIoU) reduction in the effectiveness of models designed based on narrow FOV images. There are now two approaches to dealing with it. The first is to fix first and then segment [3]; this way simplifies the entire problem in order to increase the fisheye image corection impact, decreasing the difficulty of design. For example, Yang [4] proposed an algorithm for improving the seherical projection model to enhance the edge correction capability, while Ma [5] proposed a correction algorithm based on a mapping adaptive convolution and isometric projection model to perform secondary correction of the projected image. However, the distortion problem induced by the correction technique is still inescapable, particularly near the image’s edges, where information loss due to distortion is severe.

Another approach is to immediately segment the non-distortion image and construct the structure for the properties of the fisheye image, hence improving the model’s capacity to analyze fisheye images. For example, restricted deformable convolution was proposed by Deng [6] to enhance the ERFNet, while Ahmad [7] used deformable convolutional (DCN) to enhance DeepLabv3 Plus’s feature extraction capability for fisheye images. In addition to these deformable convolution optimization techniques, the attention mechanism has been utilized for semantic segmentation of fisheye images. The composite model designed with Swin-Transformer [8] for backbone achieved far better results than ordinary networks at the OmniCV [9] competition. Meanwhile, models with a large number of parameters improve the effect, their time cost and technology requirements, making them impossible to apply to the existing autonomous driving system. Second, the attention module may gather global contextual information, allowing the model to look further. However, focusing too much on the global is ineffective for categories with limited data. For example, Swin-Unet [10], after training on the WoodScape [11] dataset, has improved the effect for categories with a large amount of data, such as road and vehicle, but decreased for categories with a small amount of data, such as rider and traffic sign.

In order to balance the speed and accuracy of semantic segmentation of fisheye images, we design the DSA module based on deformable attention mechanism as well as spatial pyramid structure by referring to the ECANet [12] designed by Yang et al. for omnidirectional images. The DSA module can collect effective information at different scales and improve the attention mechanism’s local feature extraction capabilities while avoiding the challenge of learning the whole contextual information for DCN [13] as a local offset. At a glance, we deliver the following contributions:

In semantic segmentation of fisheye images, we apply the concept of deformable attention to strike a balance between accuracy and speed. A deformable mechanism that improves local feature extraction and an attention mechanism that improves global nature can improve the capacity to capture nonlinear transformations while retaining performance.
Thinking about semantic segmentation of fisheye images from the perspective of multiscale segmentation, we propose a DSA module to capture features from multiple scales using a combination of deformable attention and spatial pyramids to improve the model’s ability to segment minor and overlapping categories.

2. Materials and Methods

In this section, we introduce the modified SegNeXt [14] network for semantic segmentation of multi-labeled fisheye images. The network consists of three main components: (i) MSCAN [15] is used as backbone for feature extraction, (ii) DSA processes the features output from MSCAN for multi-scale nonlinear feature extraction, (iii) the extracted features are input into the decoder of SegNeXt and the resolution is recovered and fed into the classification layer.

2.1. Recap of MSCAN

MSCAN, as the encoder of SegNeXt and also the encoder of our current study, replaces the self-attention [16] in VIT [17] with pyramidal structure MSCA, which consists of three parts: a depth-wise convolution for aggregating local information; a multi-branch depth-wise bar convolution with shortcut to capture multi-scale contextual information; a multi-scale contextual information and a 1 × 1 convolution to adjust the channels. The overall network architecture is shown in Figure 1. The formula of MSCA is shown in Equations (1) and (2):

A t t = C o n v_{1 \times 1} (\sum_{i = 0}^{3} S c a l e_{i} (D W - C o n v (F)))

(1)

O u t = A t t \otimes F

(2)

where F is the input feature,

A t t

is the output that replaces the self-attention,

A t t

and F are multiplied by a matrix element by element to obtain

O u t

,

D W - C o n v

represents depthwise convolution,

S c a l e_{i}

represents the branch of strip convolution,

S c a l e_{0}

represents the shortcut branch. Multiple MSCA modules are cascaded to generate MSCAN, which outputs four stages; however, succeeding modules only analyze the last three stages because the first stage contains a high amount of low-level features that can be computationally costly and cause performance loss.

2.2. Brief Description of SegNeXt’s Decoder

SegNeXt designers have investigated three simple decoder structures; the first is the pure MLP-based structure adopted in SegFormer [18], where the output of each stage of the encoder is fed into the MLP for subsequent computation; the second mainly uses a CNN-based model, where the output of the encoder is used directly as the input of the decoder head, such as DeepLabv3 Plus’s ASPP, PSP and DANet. The last one is the structure adopted by SegNeXt, which only receives features from the last three stages because, as a convolution-based encoder, the features of the first stage contain too much low-level information, and processing them will bring heavy computational overhead and hurt performance. SegNeXt’s decoder concatenates the output of the three stages and then uses a lightweight Hamburger [19] to model the global context. Our decoder design also inherits their approach, choosing to characterize the last three phases that have been processed by our DSA module and modeled using the Hamburger algorithm.

The Hambuger network is designed based on a matrix decomposition approach. As a global information extractor instead of attention, the matrix decomposition is more theoretical and does not depend on manual design. The Hambuger network consists of two linear transformation layers

W_{l}

,

W_{u}

and a matrix decomposition M. The input features X are mapped to the feature space by a linear transformation

W_{l}

, and then a low-rank signal subspace is processed by M and finally the extracted signal is transformed to output by another linear transformation

W_{u}

. The Hambuger is shown by Equations (3)–(5):

min_{D, C} L (X, D C) + R_{1} (D) + R_{2} (C)

(3)

H (X) = W_{u} M (W_{l} X)

(4)

Y = X + B N (H (X))

(5)

where D is a dictionary matrix, C is a sparse matrix,

L

denotes the reconstruction error and

R_{1}

and

R_{2}

denote the regularization of D and C, respectively.

D C

multiplication is a low-rank matrix that represents the useful global information X in the input features

\tilde{X}

. In the features X, there is not only global information but also noisy information E. As a result, Hambuger converts the process of modeling global information into an optimization issue by modeling noisy information as a residual term that is rejected by an optimization method. In Equation (4), two linear transformations are used to execute the modeling and extraction process. Finally, through normalization

B N

and skip connection, the result Y is output.

2.3. Brief Description of Deformable Attention

Deformable convolutional (DCN) can only capture local features and cannot obtain complete contextual information, which is the bottleneck of DCN learning capability. Deformable attention, on the other hand, introduces attention on this basis, bringing global relationship modeling capability to it. There are various research directions and design schemes, including DPT [20] model, DAT [21] model, Deformable DETR [22] model, etc., among which we refer to the idea of DAT model. It is not designed by directly introducing DCN in the computation of QKV, which will lead to a sharp increase in space complexity, but through the offset sub-network consisting of Depthwise convolution, SyncBatchNorm [23] and GELU [24] to compute 2D offsets and through the reference point with the 2D offsets to generate the offset feature maps, and, finally, from this, K and V are computed. As a result, deformable attention possesses both the ability to extract global contextual information and the ability to fit local spatial information, which meets the needs of fisheye images.

2.4. Deformable Segmentation Attention Module

The HSA structure was built in ECANet, which works with omnidirectional images, to improve the model’s ability to extract information in 360° range via height-wise stripe pooling to achieve high-performance omnidirectional image segmentation. Based on this, we created a DSA model that is more suited to fisheye photos. Fisheye images differ from inherently long-range dependent omnidirectional images in that they lack such rich contextual correlations in the horizontal dimension, and their barrel distortion necessitates the ability to fit the shape of the distorted object and improve distortion feature extraction from the center to the edges. To build a focus information extraction module with rich global information, we combine deformable convolution with attention mechanism, replace height-wise stripe pooling with offset sub-network and replace 360° width-wise direction with global reference point. The overall architecture is depicted in Figure 2, which simulates the field of view of a fisheye image using a deformable mechanism and employs the attention mechanism to provide global contextual relevance, thereby addressing the problem of erroneous contextual information in traditional deformable convolution. Finally, to address the issue of inconsistency between the degree of center distortion and the degree of edge distortion, we refer to the design of the PSA module in ECANet and extend the SPP concept to the entire architecture, replacing the MHSA concept that was originally used to promote deformable diversity, in order to provide feature extraction capabilities for multiscale targets in complex scenes. Specifically, the DSA module contains the following components:

(1): A 1 × 1 convolutional layer and three deformable attention modules with different degrees of offset (offsets of 3, 5 and 7, respectively).
(2): A global average pooling layer is used to capture the global context information, which is then fed into a 1 × 1 convolution layer to establish the association between pixels and the global distribution region and finally restored to its original size by bilinear interpolation.
(3): The features obtained from the first and second steps are fused together in the channel dimension and then fed into a 1 × 1 convolution to adjust the channel to obtain new features.

For the input feature map

X \in R^{C \times H \times W}

, it is passed in parallel to five sections that compute the number of channels to regulate, the deformable attention for three different offset parameters and a global average pooling. The Qs relied on in the different deformable attention layers are all shared values of X computed through a learnable weight matrix

W_{q}

. Then, Q is passed through the offset sub-network to generate 2D offset. Like DCN, the deformable attention also requires reference points, which are concatenated with 2D offsets and then obtained by a bilinear interpolation to obtain the sampled features

\tilde{X}

. Here,

\tilde{X}

is an abstract concept; in fact, due to the different offset of the offset sub-network, sampled features are also different; here, there are actually three different sampled features for the subsequent calculation; the calculated K, V will also be different. The

\tilde{X}

is passed through two learnable weight matrices

W_{k}

,

W_{v}

to obtain the key K and value V. The calculation process is shown in Equations (6)–(8).

Q = X W_{q}, X_{1} = C o n v (X), I = I m a g e P o o l i n g (X)

(6)

\tilde{X} = G r i d (C o n c a t (θ_{o f f s e t} (Q), \tilde{B}))

(7)

K = \tilde{X} W_{k}, V = \tilde{X} W_{v}

(8)

where

\tilde{B}

represents the reference point in the feature map,

X_{1}

represents the channel adjustment layer and I represents the global average pooling layer. In this case, the local features learned by the offset sub-network together with the relative positional deviation enhance attention on spatial information, and this combination of local and global receptive fields allows sparse attention to fit the shape of the object and thus better model the global relationships of the distorted image.

After that, the three sets of attention are calculated according to the formula for calculating self-attention, and the channels are adjusted by convolution layers to obtain the output of the three sets of deformable attention, as shown in Equation (9).

Z_{(1, 2, 3)} = C o n v (S o f t m a x (Q K_{(1, 2, 3)}^{T} / \sqrt{d}) V_{(1, 2, 3)})

(9)

where Z is the output of each set of self-attention, d is the dimension of the hidden layer and dividing by d prevents the

S o f t m a x

value from being too large, so that the result of

Q \times K

satisfies a distribution with expectation zero and variance one, similar to normalization.

The barrel distortion is gradually strengthened from the center to the margins of fisheye images, and objects in the image are impacted differently by the distortion and require different scales of receptive fields to be processed. Second, targets in complex situations vary in size and attributes, necessitating models that can capture information at various scales. As a result, we adopt the SPP concept and design our architecture as a pyramidal structure that learns multiple local and global characteristics with varied offsets and fuses the learnt features for processing. We concatenate the outputs of the channel-adjusted conv layer, the three sets of deformable attention layers and the global average pooling layer, then use a 1 × 1 conv layer to modify the channels so that the input dim is reduced by a factor of five and output to the following decoder. The calculation process is shown in Equation (10).

O = C o n v (C o n c a t (X_{1}, Z_{1}, Z_{2}, Z_{3}, I))

(10)

3. Experiment

3.1. Dataset and Implementations

Our experiments are built on the WoodScape dataset [25], a widely used and challenging fisheye image dataset that can handle a variety of tasks. The semantic segmentation part of it contains 10,000 images covering 10 categories. In addition to the WoodScape dataset, we also examined the remaining publicly available fisheye image traffic scene datasets, including FisheyeCityScape, KITTI360 and SynWoodScape datasets. Among them, the FisheyeCityScape dataset is obtained from the CityScape dataset by zoom falsification, while the SynWoodScape dataset is a simulated dataset. Zooming reduces the image resolution, which is even worse for CityScape, whose image resolution is already low, making it unsuitable for use. The SynWoodScape dataset, as a simulated dataset, has no precedent of application and lacks authoritative judging standards, so it is not introduced in this experiment for the time being. The KITTI360 dataset, as one of the best datasets, has had its data corrected, and, considering that the scheme in this paper is for non-distorted images, it cannot be used either, so, in the end, we only considered the results on the WoodScape dataset. In our design, these images are fed into the model in the format of a VOC dataset for training and validation. Following previous works, we employ the mean average precision (MAP) and mean intersection over union (MIoU) to evaluate the effect of semantic segmentation.

We use two Titan RTX graphics cards to train the model; while in the comparison experiments, we use 8 Titan V graphics cards for Swin-Unet due to the lack of video memory, with a batch size of 2 per card, thus ensuring a consistent number of overall batch sizes. Facilitating fair comparisons with SegNeXt, the pretrained MSCAN is adopted as our backbone. The input images are resized to 512 × 512, then input into the MSCAN to extract the preliminary features. Considering that too much underlying information is not conductive to the segmentation effect, we only input the results of the last three stages into the DSA module for subsequent processing. Following the common setting, we set the output dimension

D = [128, 320, 512]

. Each DSA contains three sub-networks with different offsets; they are three, five and seven, and dropout is activated once within each group. The batch size of each GPU is 8. AdamW [26] is chosen as the optimizer to train the model for 160,000 iteration, with the weight decay of

10^{- 2}

,

(β_{1}, β_{2}) = (0.9, 0.999)

and a learning rate of

10^{- 4}

.

3.2. Comparison with State-of-the-Art Methods

We follow Ahmed’s research [27] and use his summary of the popular methods in the field of semantic segmentation of fisheye images for comparison experiments, including ERFNet [28], PSPNet [29], FRRN [30] and Fc-DenseNet103 [31], and the results are shown in Table 1. Our SegNeXt-DASPP achieves 74.5% MAP, 65.33% MIoU and Fps of 8 on Titan RTX. The improvement over the baseline SegNeXt is 0.9% and 1.22%, respectively, while the time cost only increases by 16 ms, which is acceptable, and our solution is significantly better than the popular methods summarized by Ahmed.

Additionally, we compare our result with the state-of-the-art methods in the field of fisheye and omnidirectional images—ECANet and our method’s MIoU is also higher by 5.13%. On this basis, we also experimented with combining HSA, the basic component of ECANet, with SegNeXt to rule out the possibility that the increase in experimental findings is completely due to SegNeXt. On this basis, we also experimented with combining HSA, the basic component of ECANet, with SegNeXt to rule out the possibility that the increase in experimental findings is completely due to SegNeXt. The MIoU of SegNeXt-HSA is 63%, which is 2.33% lower than our result. Also, our method is superior to the traditional methods summarized by Ahmed as the most applicable to fisheye images—FRRN and FCDenseNet103.

Then, as a lightweight network, our solution is acceptable in terms of time cost, and, even if we utilize an attention mechanism, the increase in time cost is not large. With the exception of Swin-Unet, which has burst memory issues, our approach has a higher FPS than the traditional network summarized by Ahamed. Although there is a slight increase in time cost when compared to the popular DeepLabv3 Plus, the difference is not significant. And, as compared to SegNeXt, the DSA module added only 16ms to the processing time of a single image, which is an acceptable time cost.

The specific experimental results are shown in Figure 3 as well as Table 2, including the original plots, baseline-SegNeXt, DeepLabv3 Plus [32] and the result plots of our designed SegNeXt-DSA, respectively. In linear categories such as roads, lane markings and curbs, there is little difference between the three methods. We believe that, even if it is affected by severe distortion, the linear target focuses on how to overcome the similarity of similar targets, and its gap with other categories is still huge, which is not the direction of improvement in this study. In the categories of person and vehicles, thanks to the good modeling ability of matrix decomposition and the global feature capture ability of MSCAN, the network with SegNeXt as the backbone obtains better results. The DSA module deepens this feature and strengthens the segmentation effect of deformed targets. The segmentation of bicycle, motorcycle and traffic sign classes is improved by our method. DeepLabv3 Plus is unable to discriminate overlapping bicycles and classifies them as vehicles and confuses bicycles with motorcycles. Our method, on the other hand, can distinguish between motorcycles and bicycles and achieves some segmentation in overlapping bicycles. On these three classes, our scheme achieves 4.65%, 3.12% and 1.37% improvement, respectively, which indicates that the combination of multi-scale pyramid structure and deformable attention is effective for segmentation of distorted images.

3.3. Ablation Studies

Contributions of the Proposed Method

In this part, we separately combined the contribution of the deformable attention and SPP structure to the experimental results without altering the backbone and decoder, and the results are shown in Table 3. In the ablation experiments of the spatial pyramid structure, we compared the results of the ASPP structure with those of the normal SPP structure and found that the ASPP results were superior to those of the SPP structure, so the ASPP is used to represent the spatial pyramid structure in the table and in subsequent presentations. Through the data in the table, we found something very interesting. When we only use the deformable attention, the effect does not improve but decreases, MIoU drops to 63.59% and MAP drops to 72.4%. We found after research that this is because the network’s ability to segment linear categories that are more affected by distortion is enhanced by adding the deformable attention, such as the road category from 92.68% to 92.88% and the lanemarks category from 66.44% to 66.59%, while other categories, especially the small ones, show a significant decline, such as traffic sign dropped from 42.98% to 37.41%. Combining the results of using only SPP structure, we can draw the following conclusions: 1. using deformable attention mechanism can improve the information extraction ability of the network for distorted targets, but the attention mechanism overfocusing on important information can also lead to certain information loss, which is also reflected in the results of Swin-Unet; 2. using spatial pyramid structure can extract information at multiple scales. Its combination with the deformable attention mechanism compensates to partly ameliorate the problem of information loss by the latter, which is very important for small-target segmentation in distorted environments. Thus, the combination of the two enables the model to gain both MAP and MIoU for semantic segmentation of fisheye images.

3.4. Choice of K-Values in Deformable Attention Mechanism

The K-value, as the size of the convolutional kernel in the offset sub-network, which determines the size of the area to be covered in the computation of the offset, is the key parameter used to capture localized features. An offset region that is too small is difficult to cover the whole patch when combined with the reference point, while too large leads to a spike in time complexity, and, at the same time, too large a sensory field is not friendly to small targets. So, in this part, we experimented with the effect of the K-values in the deformable attention module that determine the size of the convolutional kernel of the offset sub-network on the experimental results, including four combinations: [3, 5, 7], [3, 7, 9], [3, 5, 9] and [5, 7, 9], with MIoU of 65.33%, 64.87%, 65.05% and 65.09%, respectively, where each category’s specific data are shown in Table 4 and Table 5. It can be seen that, when the value of K increases, the increase in the convolution kernel of the offset sub-network leads to a worse segmentation of the model for small targets and linear targets, which are heavily affected by distortion, such as the traffic sign class, curb class and bicycle class. Meanwhile, for those large target classes that generally do not appear in the edge part of the image and are less affected by aberrations or have less severe deformation after aberrations, such as people, vehicles and motorcycles, their segmentation results are better with the increase in convolutional kernels and, overall, it is still the combination of small convolutional kernels that is more appropriate. After all, the attention mechanism already provides good global context extraction capability, and what needs to be compensated more is the ability to extract small-scale feature information. Second, when the K-values of three offset sub-networks in a DSA module are not continuously varied, the result becomes worse, especially for the small target class. We believe that this phenomenon indicates that irregular changes in the size of the convolution kernel within the DSA during multi-scale feature extraction affect the model’s ability to balance and combine features at various scales, which in turn affects the final results.

4. Discussion

In this part, we discuss the shortcomings of our scheme and the directions for improvement. First, the scheme we designed is designed to balance real time and accuracy, which we achieved through a lightweight model. However, with the introduction of deformable attention, the time cost incurred is small but not negligible. We have tried to reduce the number of DSA modules, but this will reduce the accuracy, at least regarding SegNeXt. We believe that, in subsequent designs, we need to liberate the DSA modules by modeling a more capable decoder with a more capable encoder for extraction, so that they can focus on fitting the fisheye image field of view [33], which in turn reduces the processing time. In addition to this, we also thought about the segmentation of small targets, and, although our scheme improves the segmentation ability of small targets, it still suffers from the problem of intra-class overlap and inter-class imbalance [34]. We believe that the multi-scale feature fusion architecture can solve these problems, so, in the subsequent decoder design, we introduced the multi-scale feature fusion architecture to enhance the multi-scale modeling capability of the model. Finally, we experimented whether our scheme can be used in other domains as well. We found that our model is able to improve the results in complex traffic scenarios with more large targets but is less effective in small target segmentation domains, such as UAV scenarios [35]. This indicates that the deformable mechanism under the global feature addition does not serve small targets well, and, in the subsequent improvement for small targets in fisheye images, it is also necessary to jump out of the field of deformable mechanisms to think about new schemes.

5. Conclusions

In this paper, we propose a novel deformable segmentation attention module (DSA) to deal with the problem of imbalance between real time and accuracy in semantic segmentation of fisheye images. In DSA, our proposed structure combining deformable attention and spatial pyramid can, effectively and in a balanced manner, extract the feature information of each scale in the distorted image so as to improve the robustness of the model to the spatial layout. Moreover, the time overhead increase is not significant after introducing this module in the lightweight model SegNeXt. In future improvements, we plan to solve the existing problem that occluded objects in distorted images being segmented into the wrong category or not distinguishable from other categories of objects around, such as overlapping bicycles, people on motorcycles and traffic signs partially blocked by the background, thus further improving performance.

Author Contributions

Conceptualization, H.L.; Methodology, H.L. and Y.F.; Software, J.J.; Validation, J.J.; Formal analysis, C.X.; Investigation, M.J.; Resources, M.J.; Data curation, Y.F.; Writing—original draft, J.J.; Writing—review & editing, C.X.; Visualization, J.J.; Supervision, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported, the National Natural Science Foundation of China (Grant No. 62171042, 62102033, 62006020), the R&D Program of Beijing Municipal Education Commission (Grant No. KZ202211417048), The Project of Construction and Support for high-level Innovative Teams of Beijing Municipal Institutions (Grant No. BPHR20220121), Beijing Natural Science Foundation (Grant No. 4232026), the Academic Research Projects of Beijing Union University (No. ZKZD202302).

Data Availability Statement

The data that support the findings of this study are openly available in https://github.com/valeoai/woodscape.

Conflicts of Interest

The authors declare no conflict of interest.

References

Choi, S.; Kim, J.T.; Choo, J. Cars Can’t Fly Up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9370–9380. [Google Scholar] [CrossRef]
Komatsu, R.; Fujii, H.; Tamura, Y.; Yamashita, A.; Asama, H. 360° Depth Estimation from Multiple Fisheye Images with Origami Crown Representation of Icosahedron. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10092–10099. [Google Scholar] [CrossRef]
Li, D.; Ma, N.; Gao, Y. Future vehicles: Learnable wheeled robots. Sci. China Ser. F Inf. Sci. 2020, 63, 193201. [Google Scholar] [CrossRef]
Yang, Q.; You, L.; Zhao, L. Study of fisheye image correction algorithm based on improved spherical projection model. Chin. J. Electron. Devices 2019, 42, 449–452. [Google Scholar] [CrossRef]
Ma, H.; Zhu, L.; Zeng, J. Fisheye image distortion correction algorithm based on mapping adaptive convolution and isometric projection. Mod. Comput. 2021, 51–56. [Google Scholar] [CrossRef]
Deng, L.; Yang, M.; Li, H.; Li, T.; Hu, B.; Wang, C. Restricted Deformable Convolution-Based Road Scene Semantic Segmentation Using Surround View Cameras. IEEE Trans. Intell. Transp. Syst. 2020, 21, 4350–4362. [Google Scholar] [CrossRef]
Playout, C.; Ahmad, O.; Lecue, F.; Cheriet, F. Adaptable Deformable Convolutions for Semantic Segmentation of Fisheye Images in Autonomous Driving Systems. arXiv 2021, arXiv:2102.10191. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Ramachandran, S.; Sistu, G.; McDonald, J.; Yogamani, S. Woodscape Fisheye Semantic Segmentation for Autonomous Driving—CVPR 2021 OmniCV Workshop Challenge. arXiv 2021, arXiv:2107.08246. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; pp. 205–218. [Google Scholar] [CrossRef]
Yogamani, S.; Hughes, C.; Horgan, J.; Sistu, G.; Chennupati, S.; Uricar, M.; Milz, S.; Simon, M.; Amende, K.; Witt, C.; et al. WoodScape: A Multi-Task, Multi-Camera Fisheye Dataset for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9307–9317. [Google Scholar] [CrossRef]
Yang, K.; Zhang, J.; Reis, S.; Hu, X.; Stiefelhagen, R. Capturing Omni-Range Context for Omnidirectional Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 19–25 June 2021; pp. 1376–1386. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Volume 30, pp. 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Volume 34, pp. 12077–12090. [Google Scholar]
Geng, Z.; Guo, M.H.; Chen, H.; Li, X.; Wei, K.; Lin, Z. Is Attention Better Than Matrix Decomposition? arXiv 2021, arXiv:2109.04553. [Google Scholar] [CrossRef]
Chen, Z.; Zhu, Y.; Zhao, C.; Hu, G.; Zeng, W.; Wang, J.; Tang, M. DPT: Deformable Patch-Based Transformer for Visual Recognition. In Proceedings of the 29th ACM International Conference on Multimedia MM ’21, New York, NY, USA, 20–24 October 2021; pp. 2899–2907. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer With Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-End Object Detection With Dynamic Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2988–2997. [Google Scholar]
Ma, N.; Li, D.; He, W.; Deng, Y.; Li, J.; Gao, Y.; Bao, H.; Zhang, H.; Xu, X.; Liu, Y.; et al. Future vehicles: Interactive wheeled robots. Sci. China Inf. Sci. 2021, 64, 208–210. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Bridging NONLINEARITIES and Stochastic Regularizers with Gaussian Error Linear Units. 2017. Available online: https://openreview.net/forum?id=Bk0MRI5lg (accessed on 24 September 2023).
Sekkat, A.R.; Dupuis, Y.; Kumar, V.R.; Rashed, H.; Yogamani, S.; Vasseur, P.; Honeine, P. SynWoodScape: Synthetic Surround-View Fisheye Camera Dataset for Autonomous Driving. IEEE Robot. Autom. Lett. 2022, 7, 8502–8509. [Google Scholar] [CrossRef]
Bjorck, J.; Weinberger, K.; Gomes, C. Understanding Decoupled and Early Weight Decay. arXiv 2020, arXiv:2012.13841. [Google Scholar] [CrossRef]
Sekkat, A.R.; Dupuis, Y.; Honeine, P.; Vasseur, P. A comparative study of semantic segmentation of omnidirectional images from a motorcycle perspective. Sci. Rep. 2022, 12, 4968. [Google Scholar] [CrossRef] [PubMed]
Romera, E.; Álvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2018, 19, 263–272. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Pohlen, T.; Hermans, A.; Mathias, M.; Leibe, B. Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3309–3318. [Google Scholar] [CrossRef]
Jegou, S.; Drozdzal, M.; Vazquez, D.; Romero, A.; Bengio, Y. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Los Alamitos, CA, USA, 21–26 July 2017; pp. 1175–1183. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar] [CrossRef]
Zhang, E.; Xie, R.; Bian, Y.; Wang, J.; Tao, P.; Zhang, H.; Jiang, S. Cervical cell nuclei segmentation based on GC-UNet. Heliyon 2023, 9, e17647. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Cui, R.; Wei, R.; Fu, Z.; Hong, Z.; Feng, G. Tracking by segmentation with future motion estimation applied to person-following robots. Front. Neurorobot. 2023, 17, 1255085. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Hong, Z. Unexpected Dynamic Obstacle Monocular Detection in the Driver View. IEEE Intell. Transp. Syst. Mag. 2023, 15, 68–81. [Google Scholar] [CrossRef]

Figure 1. An overview of the proposed model based on SegNeXt for each set of MSCAN features in SegNeXt that will be input to the decoder and processed by the DSA. All the * in the figures indicate multiplication operations. The specific construction of the DSA module is shown in Figure 2.

Figure 2. A diagram of the overall architecture of the DSA model, consisting of a deformable attention mechanism with a parallel spatial pyramid structure. The offset sub-network is derived from the design of DAT, which consists of three parts: a k × k depthwise separable convolution, a GELU activation layer and a 1 × 1 convolution layer for adjusting the channels.

Figure 3. The original graph and the results predicted by each network using the original graph, in which DeepLabv3 Plus failed to segment the overlapping bikes and confused them with motorcycles. SegNeXt, on the other hand, was able to distinguish the motorcycle from the bike but still failed to recognize the overlapping bikes. Finally, SegNeXt with DSA was able to segment the overlapping bikes to some extent.

Table 1. Comparing our method with known state-of-the-art methods on MIoU and MAP.

Method	FPS	Quantity
Method	FPS	MAP (%)	MIoU (%)
ERFNet	33.33	-	43.3
PSPnet	-	67.00	50.00
Swin-Unet(L)	0.82	-	44.03
ECANet	-	-	60.2
FRRN	2.86	73.11	22.18
FC-DenseNet103	1.26	67.5	31.67
DeepLabv3+	9.71	73.22	60.69
DeepLabv3+&DSA(ours)	9.17	74.52	61.44
SegNeXt	9.18	73.6	64.11
SegNeXt-HSA	8.93	72.62	63.0
SegNeXt-DSA(ours)	8	74.5	65.33

Table 2. Specific data on each category for the three models with the best results (MIOU (%)).

Class	DeepLabv3 Plus	SegNeXt	SegNeXt-DSA
Background	96.46	96.93	97.07
Road	92.22	93.04	92.94
Lanemarks	65.18	66.44	66.77
Curb	51.28	49.88	50.63
Person	53.12	63.41	61.38
Rider	44.55	51.51	52.91
Vehicles	77.45	81.54	82.28
Bicycle	45.87	47.13	51.78
Motorcycle	47.49	50.04	53.16
Traffic_sign	40.76	42.98	44.35

Table 3. Results of separate experiments using different architectures.

Method	MAP (%)	MIoU (%)
SegNeXt	73.6	64.11
SegNeXt-DAT	72.4	63.59
SegNeXt-ASPP	73.34	64.61
SegNeXt-DSA	74.5	65.33

Table 4. Detailed category comparison by combination of different offsets on MIoU (%).

Category	[3, 5, 7]	[3, 7, 9]	[3, 5, 9]	[5, 7, 9]
Background	97.07	96.84	96.94	96.90
Road	92.94	92.41	92.64	92.54
Lanemarks	66.77	66.22	66.46	66.51
Curb	50.63	49.84	49.86	49.98
Person	61.38	61.43	61.70	61.68
Rider	52.91	52.40	53.06	52.75
Vehicles	82.28	81.97	82.02	82.03
Bicycle	51.78	51.25	50.60	51.07
Motorcycle	53.16	53.82	54.09	53.99
Traffic_sign	44.35	42.49	43.18	43.48

Table 5. Detailed category comparison by combination of different offsets on MAP (%).

Category	[3, 5, 7]	[3, 7, 9]	[3, 5, 9]	[5, 7, 9]
Background	98.86	98.8	98.87	98.79
Road	96.27	95.93	95.93	96.06
Lanemarks	74.28	73.25	74.17	74.01
Curb	63.11	62.55	61.71	62.05
Person	71.48	71.3	72.44	72.65
Rider	66.57	63.6	66.69	65.26
Vehicles	89.62	89.42	89.56	89.46
Bicycle	66.15	65.9	63.48	65.17
Motorcycle	65.9	65.53	67.83	66.68
Traffic_sign	52.83	49.59	50.37	51.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, J.; Xu, C.; Liu, H.; Fu, Y.; Jian, M. DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation. Electronics 2023, 12, 4059. https://doi.org/10.3390/electronics12194059

AMA Style

Jiang J, Xu C, Liu H, Fu Y, Jian M. DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation. Electronics. 2023; 12(19):4059. https://doi.org/10.3390/electronics12194059

Chicago/Turabian Style

Jiang, Junzhe, Cheng Xu, Hongzhe Liu, Ying Fu, and Muwei Jian. 2023. "DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation" Electronics 12, no. 19: 4059. https://doi.org/10.3390/electronics12194059

APA Style

Jiang, J., Xu, C., Liu, H., Fu, Y., & Jian, M. (2023). DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation. Electronics, 12(19), 4059. https://doi.org/10.3390/electronics12194059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSA: Deformable Segmentation Attention for Multi-Scale Fisheye Image Segmentation

Abstract

1. Introduction

2. Materials and Methods

2.1. Recap of MSCAN

2.2. Brief Description of SegNeXt’s Decoder

2.3. Brief Description of Deformable Attention

2.4. Deformable Segmentation Attention Module

3. Experiment

3.1. Dataset and Implementations

3.2. Comparison with State-of-the-Art Methods

3.3. Ablation Studies

Contributions of the Proposed Method

3.4. Choice of K-Values in Deformable Attention Mechanism

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI