Boosting Semantic Segmentation by Conditioning the Backbone with Semantic Boundaries

In this paper, we propose the Semantic-Boundary-Conditioned Backbone (SBCB) framework, an effective approach to enhancing semantic segmentation performance, particularly around mask boundaries, while maintaining compatibility with various segmentation architectures. Our objective is to improve existing models by leveraging semantic boundary information as an auxiliary task. The SBCB framework incorporates a complementary semantic boundary detection (SBD) task with a multi-task learning approach. It enhances the segmentation backbone without introducing additional parameters during inference or relying on independent post-processing modules. The SBD head utilizes multi-scale features from the backbone, learning low-level features in early stages and understanding high-level semantics in later stages. This complements common semantic segmentation architectures, where features from later stages are used for classification. Extensive evaluations using popular segmentation heads and backbones demonstrate the effectiveness of the SBCB. It leads to an average improvement of 1.2% in IoU and a 2.6% gain in the boundary F-score on the Cityscapes dataset. The SBCB framework also improves over- and under-segmentation characteristics. Furthermore, the SBCB adapts well to customized backbones and emerging vision transformer models, consistently achieving superior performance. In summary, the SBCB framework significantly boosts segmentation performance, especially around boundaries, without introducing complexity to the models. Leveraging the SBD task as an auxiliary objective, our approach demonstrates consistent improvements on various benchmarks, confirming its potential for advancing the field of semantic segmentation.


Introduction
Semantic segmentation is an actively studied field in computer vision and is crucial for various challenging applications such as autonomous driving and virtual reality.Semantic segmentation is a pixel-wise classification task where each pixel represents a category.A standard metric of quantifying segmentation quality is the intersection-over-union (IoU) metric, defined as the ratio of the intersection of the predicted segmentation mask and the ground-truth (GT) segmentation mask to the union of the two masks.With most methods competing for the best IoU score, the boundary quality of the segmentation masks is often overlooked Cheng et al. [2021].However, more precise object segmentation masks can significantly benefit various downstream applications, such as object proposal generation Bertasius et al. [2015], depth estimation Ramamonjisoa et al. [2020], and image localization Ramalingam et al. [2010].
Closely related to semantic segmentation, semantic boundary detection (SBD) is also an active computer vision research topic.SBD is a multi-label classification task formulation of the classical binary edge detection task, which requires the model to classify the edges and their category.Since boundaries are always surrounding the segmentation map, SBD is often considered a dual problem of semantic segmentation.
Joint modeling of segmentation and boundary detection has recently become popular to combat the issues of poor boundary quality in semantic segmentation Takikawa et al. [2019], Li et al. [2020], Zhen et al. [2020], Yu et al. [2021].Pred.

GT Segementation Mask
Figure 1: A simple overview of the Semantic Boundary Conditioned Backbone (SBCB) framework.The semantic boundary detection (SBD) head is applied to the backbone of the semantic segmentation head during training.The on-the-fly (OTF) semantic boundary generation module generates ground-truth (GT) semantic boundaries to train the SBD head.This simple framework improves the segmentation quality because the task of SBD is complementary but more challenging than the main task, which forces the backbone network to explicitly and jointly model boundaries and the relation to semantics.
Not only do these approaches improve segmentation accuracy around the boundaries, but they also prove that explicit modeling of the boundaries improves the overall IoU as well.The most common approach for joint modeling is to propose a novel method of using the features learned in the boundary heads to improve the segmentation quality.Although effective, these methods require specific architectures and are not easily transferable to other segmentation models.The SegFix Yuan et al. [2020] method is a notable exception, an effective post-processing method that improves the segmentation quality by fixing the segmentation errors around the boundaries.However, SegFix requires the user to train a separate post-processing model and adds another step during inference.We argue that we can intrinsically improve the segmentation quality by conditioning the backbone of the segmentation head on the semantic boundaries, a technique that is model-agnostic and can be applied to any hierarchical backbones.
To this end, we present the Semantic Boundary Conditioned Backbone (SBCB) framework, a training framework aimed at boosting the segmentation quality of various segmentation architectures.In this framework, we add a lightweight SBD head on the backbone of the segmentation network during training and perform multi-task training.The SBD head is specifically designed so that the earlier stages of the backbones are conditioned on low-level features, and the later stages on higher-level semantic understanding.We can discard the SBD head during inference, retaining the benefits of the conditioned backbone without any computational costs or an increase in network parameters.The models trained using our framework consistently improve significantly in their metrics, especially around the mask boundaries.We show effectiveness by applying our framework to various segmentation models with varying segmentation heads and backbones.The contributions are as follows: • We propose a model-agnostic training framework aimed at conditioning the backbone for semantic segmentation called the Semantic Boundary Conditioned Backbone (SBCB) framework.This is the first training framework that utilizes semantic boundaries as an auxiliary task to improve various segmentation models both in terms of IoU and boundary Fscore.Our framework only uses the SBD head during training and does not add any computational costs during inference.We provide extensive experiments to prove the effectiveness of the framework.
• We propose the Binary Boundary Conditioned Backbone (BBCB) framework to compare with the SBCB framework and show that SBD is the perfect auxiliary task.The use of binary boundaries and edges has been vaguely proposed by previous works as auxiliary tasks for a specific architecture, yet it has not been made into a generalized framework compatible with various architectures.
• We propose applying our framework to customized architectures such as BiSeNet, STDC, and the recent vision transformers.
• We propose methods of utilizing the SBD head used in the SBCB framework for explicit feature fusion and show how the SBCB framework further contributes to the research in multi-task models of semantic segmentation and SBD.
• The SBCB framework is open-sourced to benefit the community.
Semantic Segmentation.In computer vision, semantic segmentation is one of the most popular and challenging tasks and boasts a rich set of prior works.Long et al. Long et al. [2015] proposed an end-to-end trainable fully convolutional network adapted from image classification models.In Chen et al. [2017], the authors introduced dilated convolution and atrous spatial pyramid pooling (ASPP) to capture multi-scale contextual information.et al. [2021], Liu et al. [2021] for semantic segmentation has become popularized due to its capability of learning long-range contexts Strudel et al. [2021], Xie et al. [2021].In this paper, we do not explicitly explore new methods of contextual modeling for semantic segmentation.Instead, we introduce a framework that can be easily integrated with these models and demonstrate how our framework can improve upon these baselines.
Meanwhile, there have been works for directly modeling boundary information for segmentation using novel loss functions Chen et al. [2020], Wang et al. [2022].Our work focuses on multi-task learning of semantic segmentation and boundaries, which can also incorporate these loss functions.
Edge and Semantic Boundary Detection.Similar to semantic segmentation, edge and boundary detection have been widely studied.Xie et al. Xie and Tu [2015] introduced a CNN model that can be trained end-to-end, which paved the way for various edge detection models like Liu et al. [2017], Pu et al. [2022].Yu et al. Yu et al. [2017] extended the task of binary edge detection to semantic boundary detection (SBD) by formulating the problem as multi-label pixel-wise classification.Hu et al. Hu et al. [2019] introduced a dynamic fusion model with adaptive weights for better contextual modeling.DDS Liu et al. [2022a] proposed a deep supervision framework that supervises all side outputs and is currently the state-of-the-art method for SBD.
Multi-Task Learning.In this paper, we specify multi-task learning (MTL) as an explicit joint modeling of two or more tasks like the method introduced in Misra et al. [2016], Kokkinos [2017], Xiao et al. [2018], Xu et al. [2018].
While most of the models in computer vision are task-specific, there is great interest in joint modeling.Solving multiple problems with a single model could create efficient systems and improve recognition for general AI, such as embodied agents Xia et al. [2018], Narasimhan et al. [2020].In MTL, it is common to use a multi-head architecture with a shared backbone for memory efficiency.The backbone is aimed to learn a shared representation between the tasks, but often times this fails due to the backbone being designed for a single-task, leading to worse results Kokkinos [2017], Misra et al. [2016].The works of Liu et al. [2018] explores novel mechanism for obtaining features by adding task-specific attention modules in the backbone.Our work, however, explores the use of two-head architecture, where the auxiliary semantic boundary detection task is complementary to the main segmentation task.
In semantic segmentation, edges and boundaries have been used as auxiliary tasks.Takikawa et al. Takikawa et al. [2019] introduced an MTL framework using binary boundary detection as an auxiliary task to improve semantic segmentation, especially for pixels near mask boundaries.Similarly, Li et al.Li et al. [2020] introduced a novel framework for explicitly modeling the body and edge features.This paper explores the joint modeling of semantic segmentation and semantic boundaries as an MTL framework for conditioning the backbone features.
Zhen et al.Zhen et al. [2020] introduced the first joint semantic segmentation and boundary detection (JSB) model and proposed the iterative pyramid context module and duality loss that enforces consistency between the two tasks.Yu et al. Yu et al. [2021] proposed a dynamic graph propagation approach to couple the two tasks and refine segmentation and boundary maps.In this paper, we introduce a simple yet effective modular multi-headed model that does not require complex modeling to explicitly fuse the two tasks.We show that a shared backbone is enough to improve both tasks significantly.We also show that we can develop a JSB model using the semantic boundary head used in our framework, which can further boost semantic segmentation performance.
SegFix.SegFix Yuan et al. [2020] is a model-agnostic post-processing network that refines the output of a segmentation model with an independent network.The key idea of this method is to replace unreliable predictions in the mask boundaries with reliable interior labels.SegFix is similar to our approach in that we both aim to improve segmentation quality using boundaries in a model-agnostic way.The key difference is that our method is a training framework, whereas SegFix requires training another model and two-step inference.In fact, SegFix can be combined with our framework to boost performance, which we will show in this paper.Figure 2: Overview of the CASENet Architecture.The architecture utilizes sides 1, 2, 3, and 5 of the backbone, where the Side Layer is applied.The Side Layers consist of a single convolutional layer followed by a deconvolutional layer which upsamples the feature resolution to the size of the input image.While this Side Layer works fine, the output prediction produces heavy artifacts.To mitigate the artifacts, we used a 1 × 1 convolutional kernel followed by bilinear upsampling with a 3 × 3 convolutional kernel.Finally, the features are concatenated into a single tensor using sliced concatenation and is applied to a 1 × 1 grouped convolutional kernel with the number of sides (four) as the group.We use the output of the last Side Layer as an auxiliary output which is supervised with semantic boundaries.

Approach
The overview of the Semantic Boundary Conditioned Backbone (SBCB) framework is shown in Figure 1.During training, we add a semantic boundary detection (SBD) head to the backbone, which receives multi-scale features from selected stages of the backbone.The SBD head is supervised using ground-truth (GT) semantic boundaries that are generated on-the-fly using the GT segmentation masks.During inference, if the targeted task does not require SBD, the SBD head can be discarded, resulting in a semantic segmentation model with no increase in parameters.
In Section 3.1, we will go over existing SBD architectures and introduce the SBD heads that we will use in our experiments.In Section 3.2, we will go into detail about the framework by applying the SBCB framework to DeepLabV3+ and HRNet.In Section 3.3, we will explain the OTF semantic boundary generation module, which is the  key to making this framework flexible and easy to use.Finally, in Section 3.4, we will explain the loss function used for the framework.

Semantic Boundary Detection Heads
In this section, we review some major SBD models based on ConvNets that have come out over the years.This section will help readers understand the SBD head used in the SBCB framework as well as the experiments.We also provide some helpful modifications that we have found worked well during our reimplementation.Finally, we also introduce the "Generalized" versions of these SBD heads that we use in the SBCB framework.
CASENet.The CASENet architecture was proposed by Yu et al. Yu et al. [2017], which suggested a novel nested architecture without deep supervision on ResNet.The architecture is depicted in Figure 2. The ResNet backbone is modified to capture features with larger resolution (explained in depth in Section 5.7).At each stage of the backbone except for stage 1, the features are passed into the Side Layer, which consists of 1 × 1 convolutional kernel followed by a deconvolutional layer to increase the resolution to match the input image.Throughout the paper, we use "Stage" and "Side" interchangeably.Stages are based on the original papers of the backbone, oftentimes not including the Stem.We use "Side", a term used in SBD-related papers, which includes Stem.The last Side Layer (Side 5) outputs an N cat × H × W tensor while the other Side Layers (Side 1 to 4) will output 1 × H × W , where N cat is the number of categories, and H and W are height and width of the image.The outputs of the Side Layers are followed by a Fuse layer which consists of a sliced concatenation of each feature with 1 × 1 convolution kernel to output an N cat × H × W a logit, which is supervised by semantic boundaries.The output of the last Side Layer is also supervised by semantic boundaries, which are used as an auxiliary signal.The details for semantic boundary supervision loss L SBD for Fuse Layer and the last Side Layer is explained in Section 3.4.
We noticed that the original implementation of the Side Layer produces boundaries with heavy checkerboard artifacts and replaced the Side Layers with bilinear upsampling followed by a 3 × 3 convolutional kernel as shown in Figure 2.This technique was introduced for generative models using deconvolution Odena et al. [2016], and we modified it to not increase the number of parameters.
DFF.The DFF architecture was proposed in Hu et al. [2019] to improve the CASENet architecture by introducing the Adaptive Weight Learner to refine the output of the Fuse layer with attentive weights.As shown in Figure 3, the Fuse layer outputs the sliced concatenated features, and instead of a 1 × 1 convolutional kernel, the weights obtained by the Adaptive Weight Learner are applied to the tensor and summed so that the output tensor is Generalized SBD heads.To facilitate the SBCB framework, we generalize the SBD heads to be applied to various backbones and segmentation architectures.We call this SBD head the Generalized SBD head, as shown in Figure 5.In our framework, we generalized the architecture to have flexible Side and Fuse layers to apply any previously mentioned SBD heads (CASENet, DFF, and DDS).The Side Layer could be the Side Layers introduced in CASENet or the Side Blocks in DDS.The Fuse Layer could be the Fuse Layer introduced in CASENet or the Fuse Layer with Adaptive Weight Learner in DFF.The number of Sides is also flexible where semantic boundaries supervise the N th side output with binary boundaries supervising the earlier side outputs when DDS is used.

Framework
In this section, we will introduce how we apply the SBD heads we reviewed in Section 3.1 for the SBCB framework.To make the framework more comprehensive, we will provide case studies of applying the SBCB framework to popular architectures such as DeepLabV3+ and HRNet.The SBCB framework can be applied similarly to the other architectures, and we will explore this in Section 7.
DeepLabV3+ + SBCB.To apply the SBCB framework to DeepLabV3+, we do not need to adjust the number of Side Layers since the backbone is ResNet as shown in Figure 6.We take the features from each side and use them for the SBD head.The general method of applying the SBCB framework will not change for different SBD heads.For example, when applying the DDS head, we take the Side 4 features and change the Side Layers to Side Blocks.
HRNet + SBCB.The HRNet backbone is composed of four stages, as shown in Figure 7. Since the first stage already reduces the resolution to 1/4, we use the features from the stem for the first Side Layer.The HRNet differs from ResNet in that the feature resolutions are consistent throughout the stages while branching out into smaller resolutions in each stage.Because of this, we resize and concatenate the features of each stage before feeding it through the Side Layer.We take all the features of each stage to motivate better conditioning of the backbone.
To apply the SBCB framework to different backbone architectures, we must consider the following, • Does the first Side Layer receive features with the largest resolution?• Are any features not being utilized at each Side or Stage?• Which Side or Stage is best suited for semantic boundary supervision?
When applying SBCB to hierarchical backbones like ResNet, the earlier stages should be applied to binary side layers, while the last layer is naturally suited for the semantic side layer.Fortunately, most semantic segmentation architectures use some sort of hierarchical backbones, which makes applying the SBCB framework simple.When we have backbones such as HRNet, where features are hierarchical and branching out, we must make sure to incorporate all of the features; i.e., concatenate.For heavily customized backbones, like the ones we will explore in Section 7, we can still apply the SBCB framework by considering the three key items.Some backbones that are developed for classification tasks may downsample the feature resolution.It may be beneficial to increase the feature resolution by changing the strides and dilations of the convolutional kernel, so the first side feature has the resolution of at least a 1/2 of the input image.For this, we can apply the "backbone trick," which we will discuss in Section 5.7.

On-the-fly Ground Truth Generation
For the task of SBD and edge detection, humans manually annotate the edges.Thus, the annotated image's scales and the width of the edges are predetermined.Some datasets for SBD, such as the Cityscapes dataset and SBD dataset, provide the preprocessing of GT boundaries from semantic and instance masks to provide more training data.Nevertheless, the number of scales is limited since it is infeasible to generate various scales before training.On the other hand, in the semantic segmentation task, it is a common practice to resize and rescale the GT mask during training to remedy  The one on the left uses preprocessed boundaries, and the one on the right uses OTFGT boundaries.We can see that OTFGT boundaries have consistent boundary widths, while preprocessed boundaries will vary depending on the rescale value.
overfitting by increasing the variations of the dataset.This is impossible for semantic boundaries since resizing will result in inconsistent edge widths, as shown in Figure 9.
To remedy this, we developed a simple semantic boundary generation algorithm that is efficient enough to run in the preprocessing pipeline called the on-the-fly (OTF) semantic boundary GT generation module (OTFGT).The OTFGT generates semantic boundaries from semantic segmentation masks and can create instance-sensitive boundaries when instance segmentation masks are available.The details of the OTFGT are explained in Appendix A.

Loss Functions.
Given an input image, the model generates segmentation and boundary maps with pre-defined semantic categories.We apply cross-entropy (CE) loss, L Seg , for each pixel of the segmentation map.As for the SBD head, we apply binary where α and β are constants for balancing the effects losses from each task.S SBD is a set of semantic boundary predictions and S Bin is a set of binary boundary predictions.For CASENet and DFF, S SBD = {B sideN , B fuse }, where B sideN represents the last side output and B fuse represents the final fused prediction as shown in Figure 5.For DDS, we supervise S SBD = {B sideN , B fuse } and S Bin = {B sidek , . . .B side2 , B side1 }.

Experiment Setup
In this section, we go over the details of our experiments, including the dataset, hyperparameters, and implementations.

Datasets
In our experiments, we use three datasets, namely Cityscapes, BDD100K, and Synthia datasets.We visualize and explain each dataset in Figure 10.
Cityscapes.We evaluate our models on the popular Cityscapes dataset Cordts et al. [2016], which contains 2975 training images, 500 validation images, and 1525 testing images with 19 semantic categories.Following Yu et al. [2017Yu et al. [ , 2018a]], Hu et al. [2019], Liu et al. [2022a], the dataset has also been widely adopted as the standard benchmark for SBD.We conduct quantitative studies for both semantic segmentation and SBD on the validation set and benchmark our method on the test set for semantic segmentation.
BDD100K.The BDD100K dataset Yu et al. [2018b] is a driving dataset that is aimed at multi-task learning for autonomous driving.This dataset is the largest driving video dataset with 100K video frames and ten tasks, and it contains 10K images with a resolution of 1280 × 720 for the semantic segmentation task.The dataset is split into 7K training, 1K validation, and 2K test splits, for which we only use the training and validation split for our ablation experiments.The annotated labels are the same as the Cityscapes dataset.
Synthia.The Synthia dataset Ros et al. [2016] is a CG dataset generated using a simulator aimed at providing auxiliary datasets for Cityscapes as well as for experimenting with domain adaptation.We use the "Rand" set of the dataset, which contains 13.4K images with a resolution of 1280 × 760 with annotated categories that are the same as the Cityscapes dataset.We use Synthi as a stand-alone dataset to explore the effect of the SBCB framework under annotations with precise boundaries.We split the dataset into 10.4K training, 1.5K validation, and 1.5K test split.

Evaluation Metrics
Segmentation Metrics.We consider the mean of intersection-over-union (mIoU) for evaluating the segmentation performances.Following Takikawa et al. [2019], we adopt boundary F-score to evaluate the segmentation performance around the boundary of the masks.We use a pixel width of 3px for boundary F-score unless explicitly stated.
Boundary Detection Metrics.We follow Yu et al. [2018a] and adopt the maximum F-score (mF) at the optimal dataset scale (ODS) evaluated on the instance-sensitive "thin" protocol for SBD.

Implementation Details
Data Loading.Unless explicitly stated, we unify the experiments' training crop size, training iterations, and batch size for both tasks to 512 × 1024, 40k, and 8, respectively, for the Cityscapes dataset.We used the same parameters for Synthia and BDD100K datasets but used a crop size of 640 × 640.We fine-tuned the models evaluated in the Cityscapes test benchmark for an additional 40k iterations using the training and validation split, following the works of Yu et al. [2021].We perform common data augmentations, notably random scaling (scale factors in [0.5, 2.0]), horizontal flip, and photo-metric distortions.
Optimization.We employ the SGD optimizer with a momentum coefficient of 0.9 and a weight decay coefficient of 5 × 10 −4 during training.We optimize the network by using the "poly" learning rate policy where the initial learning rate (0.01) is multiplied by (1 − iter max_iter ) power with power = 9.Loss.We set α = 5 and β = 1 for our loss function in Eq. 1.
Inference.In our experiments, we conduct evaluations with single-scale whole inference for the Cityscapes dataset and slide inference for Synthia and BDD100K datasets.For evaluating semantic segmentation performance in Section 6.3, we apply multi-scale and flip (MS+Flip) inference strategy with scales of [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0].

Software and Hardware.
To conduct all of our experiments, we use PyTorch and modify the popular semantic segmentation framework "mmsegmentation" Contributors [2020] for our task.We reported all experimental results using the same software and hardware and trained all models under the same conditions.The models are trained on two NVIDIA A6000 GPUs and evaluated on a single NVIDIA RTX8000.

Ablation Studies
In this section, we perform ablation studies using the SBCB framework in various aspects.In Section 5.1, we compare the SBD heads and choose a candidate for experimenting throughout the paper.In Section 5.2, we figure out the optimal side configuration.In Section 5.3, we look at which categories benefit the most from the SBCB framework.In Section 5.4, we compare the SBCB framework with other auxiliary tasks.In Sections 5.6 and 5.5, we compare the SBCB framework with the state-of-the-art multi-task and post-processing method and show that our framework can complement the methods to further improving the segmentation quality.In Section 5.7, we investigate the effects of modifying the backbone configuration in a simple yet effective way to improve segmentation and SBD.In Section 5.8, we show the effects of the SBCB framework on the task of SBD.Finally, in Section 5.9, we show that our framework improves segmentation around the boundaries.

Which SBCB head to use?
In this section, we explore the effects of using different semantic boundary detection (SBD) heads for the SBCB framework and find the best candidate for further evaluation.
Table 1a shows the DeepLabV3+ model trained using three different SBD heads, CASENet, DFF, and DDS, compared with single-task baseline models.All SBD heads for the SBCB framework improve the single-task DeepLabV3+ model.We also can see that the joint training helps improve the SBD metric (maximum F-score).We also included the number of parameters and computational costs in GFLOPs to show how much the SBD heads can introduce costs during training.While DDS adds high costs for training, it is also the most performant of the three heads.On the other hand, CASENet only adds a few number of parameters to the original model.The trade-off of using DDS over CASENet for the SBCB framework might not be beneficial in terms of performance gains, which will be more evident as we evaluate DDS on other datasets and backbones.
In Figure 11, we show qualitative results of the CASENet head applied to DeeplabV3+ compared with the baselines.We can see that the additional semantic boundary supervision allows the model to detect smaller thin objects better.We can also see that the SBCB framework allows for better boundary detection with fewer artifacts and better perception of objects.The output of DeepLabV3+ trained with the SBCB framework using the CASENet head, which we show in (f) and (g).We can see that small and thin objects are recognized better using the framework and smoother boundaries with fewer artifacts.1b, where the general trend is the same as the results from Table 1a.
Different backbone.We also explore the effects of using another popular backbone, namely HRNet-48 (HR48), and the results are shown in Table 1c.This time, we can see that the CASENet head outperforms DDS and DFF by significant margins (1.0% and 0.5%, respectively).The CASENet head also achieves mF of 78.9%, identical to the heavy and inefficient single-task DDS model.Different datasets.In computer vision, the model's performance differs depending on the dataset.We additionally evaluate the SBD heads on the BDD100K dataset and Synthia, as shown in Tables 1d and 1e respectively.On the BDD100K, the DDS head significantly outperforms the baseline model and CASENet head.The DFF head performs better than the CASENet head for this dataset for the first time.As for Synthia, the CASENet head performs better than DDS.
CASENet as the candidate.While the DDS head performs better than CASENet for the most part, when we consider the additional parameters and computational costs, it is beneficial to use the CASENet head.Besides, the SBD head in the SBCB framework is only used as an auxiliary signal, and the CASENet head outperforms DDS in some results.It can be noted that when it is dire to squeeze out higher metrics and the computational costs can be ignored, using the DDS head may result in better metrics.For the rest of the paper, we use the CASENet head as our main SBD head for the SBCB framework.
In Figure 12, we show qualitative visualizations that compare DeepLabV3+ with and without the CASENet head.We can see from the feature maps obtained from the last stage of the backbone that the backbone conditioned on SBD exhibits boundary-aware characteristics, which reduces the segmentation errors, especially around the boundaries.

Which sides to supervise?
The CASENet head applied to the ResNet backbone has five sides, Sides 1, 2, 3, 4, and 5.In Table 3, we show the effect of using different side configurations.For consistency with performant single-task SBD models, we constrain Side 1 and 5 because Side 1 is required for low-level understanding and has the largest feature resolution, where Side 5 is required for high-level understanding.We added Sides 2, 3, and 4 and compared the performance gains.Note that Sides 1+2+3+5 is the original configuration.The table shows the original configuration works best on two models (PSPNet and DeepLabV3).On DeepLabV3+, configuration 1+2+3+4+5 outperforms the original configuration by 0.2%.We believe that the difference in performance gains is negligible, but users of the SBCB framework should know that each model could have an optimal side configuration.Therefore, for fairness, we choose the original configuration to evaluate other models and benchmark our methods for further evaluation.

Does it improve all categories?
Table 2 provides the per-category IoU comparisons for each model.We can see from the table that although most of the categories improve with the SBCB framework, some categories results in worse IoU.The most frequent categories are "truck", "bus", and "train", which have relatively low samples and are easily confused with "car".During training, additional measures, such as Online Hard Example Mining (OHEM), could mitigate this effect.Zhao et al. [2017], the authors added another classifier to the backbone to stabilize the training and improve segmentation metrics.In detail, the authors added the FCN head to the fourth stage (one before the last stage) in the backbone.The auxiliary FCN head is trained on the same segmentation task as the main head.This technique is still used today and abundantly in open-source projects such as mmseg.

Introduced in PSPNet
Although not used often, various papers applied binary edge and boundary detection as an auxiliary task for semantic segmentation.Even though the task of binary boundary detection is different from semantic segmentation, the authors found that the learned features in the edge detection head can be fused into the segmentation head.
In this section, we compare the SBCB framework with the mentioned auxiliary techniques, which we call "FCN" and "Binary Boundary Conditioned Backbone (BBCB)".Note that BBCB is the SBCB framework but is applied to binary boundary detection instead.We applied FCN, BBCB, and SBCB on three popular segmentation heads (PSPNet, DeepLabV3, and DeepLabV3+) and used ResNet-101 as the backbone.The results for the Cityscapes validation split are shown in Table 4a.While all auxiliary signals improve IoU, the models trained using the SBCB framework are consistently the best.The improvements of SBCB compared with BBCB are around twice, proving that the task of SBD is crucial.FCN applied on PSPNet has the most gains of 0.7%, but FCN has minimal impact on the other models.The BBCB and SBCB framework can complement FCN, and the results show it can achieve higher IoU.Another important aspect is the additional parameters these auxiliary signals bring during training.While SBCB and BBCB only add thousands of parameters, FCN adds 2.37M parameters.Considering the performance gains and the additional parameters, it is clear that boundary-based auxiliary signals provide more benefits than FCN.
We also evaluate the same models and auxiliary heads on the Synthia dataset as shown in Table 4b.Surprisingly, FCN and BBCB do not add much performance gains and even have worse metrics than the baselines.However, SBCB improves upon the baseline by over 1%.It is plausible that the features learned using FCN could have conflicted with the main heads.Compared with Cityscapes, Synthia contains precise segmentation masks rendered from a CG engine instead of human annotation.In Synthia, classes such as "human" and "bike" will have small and thin segmentation masks, which makes this dataset difficult.Although features learned on FCN complemented the features of the main head in Cityscapes, it appears that the FCN learned to derive a conflicted segmentation map.It is possible because there are more layers (parameters) in the FCN head compared to SBCB or BBCB.Ostensibly, BBCB would perform well because of its shallow (far fewer parameters than FCN) architecture, but the results are contrary.This is because the BBCB focuses on low-level features without explicitly modeling high-level semantics.We believe the polarity of the task resulted in the main head not receiving good features for semantic segmentation for Synthia.
The SBCB framework conditions the backbone with SBD, a challenging task focusing on low-level and requires highlevel features.The SBCB framework improves the segmentation metrics better than using FCN or binary boundaries as auxiliary signals because of the hierarchical modeling of the SBD task.

Comparisons with SegFix
In Table 5, we compare our framework with SegFix Yuan et al. [2020], a popular post-processing method.We obtained the results for SegFix by using the open-source code, which refines the output prediction based on the offsets learned using HRNet2x.Comparing the methods side-by-side, models trained with the SBCB framework, SegFix performs around 0.1% ∼ 0.4% better than SBCB.However, the SBCB combined with FCN (as mentioned in Section 5.4) results in competitive performance, significantly outperforming SegFix on two models.
Considering that SegFix is an independent post-processing model, our framework produces competitive results without any post-processing and additional parameters during inference.Whereas, SegFix adds a post-processing module that requires separate training.Also, motivated by the difficulty in prediction labels around the mask boundaries, SegFix is aimed to correct the predictions around the boundaries.Therefore, the base model does not actively learn boundary-aware features.On the other hand, our training framework conditions the backbone to be boundary-aware by solving SBD, as we see in Section 5.9.In other words, SegFix and our framework are complementary because boundary-aware predictions are easier for SegFix to correct.This is evident by the major improvements of using SBCB along with SegFix, as shown in the table.
Table 8: Ablation studies of the "Backbone Trick".We modified the ResNet-101 backbone's stride and dilation at each stage to ensure the number of parameters is the same but generates larger feature maps.The paper Xie and Tu [2015] introduced this technique, which we prepend "HED" for backbone that uses this trick.
(a) Results on the Cityscapes validation split.

Comparisons with GSCNN
GSCNN Takikawa et al. [2019] is a popular semantic segmentation model with binary boundary detection multi-task architecture with a dedicated shape stream that branches out from the side layers similar to the SBD heads in the SBCB framework.The key difference is that the features from the shape stream are explicitly merged into the semantic segmentation head.GSCNN for ResNet-101 backbone is a customized DeepLabV3+ that uses an ASPP module.
It is difficult to compare apples to apples since loss functions, and we do not explicitly merge the features obtained in the SBD head to the segmentation head.However, we will compare how well the SBCB framework can improve DeepLabV3+ against some of the configurations for GSCNN in Table 6.The baseline GSCNN is GSCNN without the image gradient (Canny Edge).We also include the original configuration with Canny Edge denoted by "+Canny".We also experimented with supervising the shape stream using the SBD task denoted by "SBD" and modified the shape stream by increasing the channels.Finally, we used the SBCB framework on GSCNN denoted by "+SBCB," which adds the SBD head on the backbone without any other modifications.
Compared with DeepLabV3+, GSCNN significantly improves by an additional +1.0%.Although lower than being supervised with binary boundaries, SBD supervision improves DeepLabV3+ by +0.5%, proving that boundary signals can significantly improve semantic segmentation.The SBCB framework significantly improves DeepLabV3+ by adding 0.7% and 1.1% with CASENet and DDS, respectively.This also matches the improvements using the original GSCNN configuration.Since the SBCB framework is flexible, it can be easily applied to GSCNN, giving an even higher improvement of +1.4%.

Backbone Trick
In this section, we investigate the use of the "backbone trick".In edge detection and SBD, we often use a modified backbone to increase the output resolutions of the stages without changing the number of parameters by modifying the strides and dilations for each stage.The increase in resolution is necessary for edge detection as the edges are often small, and the feature maps need to be large enough to capture the edges.Backbones such as ResNet were made for image classification and produced small feature maps unsuitable for edge detection.It is also necessary not to change the number of parameters, as we want to use the pre-trained weights.In semantic segmentation, we apply similar tricks to change the strides and dilations of the last two stages to retain the final feature resolution to 1/8 of the input image size.We show the common modifications for the ResNet backbone in Table 7.
In Tables 8a, 8b, and 8c, we show results of using the HED version of ResNet-101 (HED ResNet-101) on Cityscapes, BDD100K and Synthia respectively.Compared with the normal segmentation ResNet-101 in Table 1, the results are generally better for single-task as well as models trained with the SBCB framework.Higher performance gains are seen in the Synthia dataset, where higher-resolution feature maps may benefit the detailed and precise ground truths.
Although the "backbone trick" is common for ResNet-101, it can be applied to other backbones, such as transformer backbones, as seen in Section 7.4.Since the backbones are conditioned with SBD, the combination of SBD and the "backbone trick" can provide significant improvements without complex modeling.

Does SBCB also improve SBD metrics?
Based on the previous ablations studies, it is clear that the SBCB framework improves the metrics for semantic segmentation.We also evaluate the models trained using the SBCB framework on semantic boundary detection (SBD) performance as shown in Table 9.We compare our DeepLabV3+ trained on the SBCB framework with state-of-the-art (SOTA) SBD models and CSEL, a SOTA joint semantic segmentation and semantic boundary detection model.The table shows that those models trained on the SBCB framework can significantly outperform the SOTA single-task methods by 5% to over 10%.On joint modeling, our method can outperform CSEL without explicitly modeling in the semantic boundary detection head.We aimed to condition the backbone for semantic segmentation, but the SBCB framework also improves the SBD performance due to being conditioned on semantic segmentation, which proves the effectiveness of the SBCB framework.

Does SBCB improve segmentation around boundaries?
The SBCB framework improves segmentation quality around the mask boundaries.In Table 10, we show boundary Fscores for baseline models and models trained on the SBCB framework.The models trained using the SBCB

Experiments on ADE20k
We perform additional experiments on ADE20k which is another challenging dataset known for having 150 different classes Zhou et al. [2017].We train DeepLabV3+ with ResNet-50 and ResNet-101 as the backbone and compared the results against ones trained using the SBCB framework.The results show that the SBCB framework improves the base models by around 0.5% which is shown in Table 15.

BiSeNet
We applied the SBCB framework to Bilateral Segmentation Network (BiSeNet) V1 and V2, which are models specialized for real-time semantic segmentation Yu et al. [2018cYu et al. [ , 2020]].In both versions, the backbone is split into two paths.The Detail Path (or Spatial Path) is a shallow ConvNet composed of a few stages that retain large feature resolutions.For BiSeNetV1, the number of stages is set to four, while it is set to three in BiSeNetV2.On the other hand, the Semantic Path (or Context Path) is a deeper ConvNet designed to capture high-level semantics.While in BiSeNetV1, the Semantic Path uses off-the-shelf architectures such as ResNet-50, BiSeNetV2 uses a customized six-stage ConvNet where the features from the middle stages are supervised using FCN auxiliary heads.
We applied the SBCB framework by choosing the stages (sides) of the backbone to be supervised by the SBD head.We take three stages from the Detail Path for the Binary Sides for the SBD head and use the last stage of the Semantic Path for the Semantic Side.Note that we do not modify the original model in any way; we only add the SBD head by taking the mid features of the backbones.See Appendix C for details.
The SBCB framework's results on BiSeNet (V1 and V2) are shown in Table 16.As expected, using the SBCB framework improves the models in both IoU and boundary Fscore.This proves the SBCB framework can apply to non-common architectures and expect performance gains.

STDC
Like BiSeNet, the STDC network is efficient for real-time semantic segmentation Fan et al. [2021].However, the STDC network is a single branch network that replaces the Detail Path with the Detail Head that uses the features from the third stage to perform "detail guidance" only during the training phase.The Detail Head is supervised with "Detail GT," which is generated using a multi-scale Laplacian Convolution kernel in an on-the-fly manner similar to our method.The detail GT contains spatial details like boundaries and corners.
In this section, we replace the Detail Head with the SBD head and train using the SBCB framework.We take the first four stages of the backbone for the Binary Sides and use the output of the FFM as the Semantic Side for the SBD head (see Appendix D).The results are shown in Table 16, where we compare the original STDC with STDC that replaced the Detail Head with our SBD head.We can see significant improvements in using SBD as the auxiliary task with substantial improvements in the IoU.The Detail Head aimed at improving the segmentation quality around the boundaries, but our framework shows higher improvements in the boundary Fscore.17.We also compare the effects of adding the "Backbone Trick" denoted by "Mod" in the backbones.From the table, we can see that the SBCB framework can still be applied to improve these modern architectures and provide consistent performance gains in both IoU and boundary Fscore.

Explicit Feature Fusion
We provide two feature fusion techniques to utilize the features learned in the SBD head that can further be applied to improve segmentation.The first technique uses simple channel concatenation with few convolutional layers to motivate feature fusion, called the Channel-Merge method.Another technique is a naive merge used in GSCNN, where the features learned in the SBD head are also used in the ASPP head for DeepLabV3+, similar to GSCNN.We call the latter method the Two-Stream Merge method.The two fusion architectures are explained in more detail in Appendix E.
Table 18 shows the results of two baseline architectures with the SBCB framework and feature fusion methods applied.We can see that the feature fusion methods can further improve the segmentation performance.It also comes with the downside of making the segmentation head dependent on the SBD head, which increases computational costs.We believe that the SBCB framework helps boost existing segmentation models, and the SBD heads could further inspire exciting architectures for joint architectures like Channel-Merge and Two-Stream Merge.

Conclusion
We have proposed the SBCB framework, a simple yet effective training framework that boosts segmentation performance.
In the framework, a semantic boundary detection (SBD) head is applied to the hierarchical features of the backbone which is supervised by semantic boundaries.We have explored different SBD heads for the SBCB framework and showed that the CASENet architecture significantly improves segmentation quality without adding many parameters during training.Our experiments show that the SBCB framework improves segmentation quality on many popular backbones and segmentation heads.It also improves the segmentation quality around the boundaries which was evaluated on boundary F-score.We also have experimented with other customized backbones and recent transformer architectures to show that the SBCB framework is versatile.Not only is the SBCB framework effective, but we have also provided modifications and methods of explicit feature fusion to promote the broader use of semantic boundaries for semantic segmentation.
A On-the-fly Boundary Generation In this section, we will explain the on-the-fly (OTF) semantic boundary generation algorithm in detail.For a single label l, we apply a signed distance function (SDF) on the inner and outer masks, where the inner mask represents the pixels that are l and the outer mask represents pixels that are not l.We can then take the sum of the inner and outer masks and use the pixels under the radius as the boundary pixels.When instance segmentation maps are available, we generate per-instance distance maps, which we threshold using the same radius.We sum all the boundaries for category l (with instance boundaries) and binaries the resulting boundaries.We repeat this step for every label until we have L labels.We concatenate the L boundaries to form a L × H × W semantic boundary tensor.

B Qualitative Visualizations
We show qualitative visualizations for the results in Tables 11a and 12a

C BiSeNet + SBCB
In Figure 13, we show a detailed architecture diagram showing which features of the BiSeNet backbone are used in the SBD head.In both BiSeNet V1 and V2, the architecture is composed of a Context Path and a Spatial Path.We use the three stages of the spatial path for the earlier Side Layers of the SBD head.We used the last feature of the Aggregation Layer for the last Side Layer.

D STDC + SBCB
In Figure 14, we show a detailed architecture diagram showing how we applied the SBCB framework to the STDC architecture.The architecture is more reminiscent of a ResNet-like hierarchical backbone, but the original STDC applies a Detail Head, which uses the features of the third stage.Instead, we remove the Detail Head and instead add an SBD head by using the first four stages for the binary side layer and the final output of the FFM as the input to the semantic side layer.

E Explicit Feature Fusion Architectures
In Figure 15, we show the proposed explicit feature fusion architecture built on top of the SBCB framework called the Channel-Merge module.The diagram shows a backbone with hierarchical features and a PPM head used in the PSPNet.The Channel-Merge module uses the features before upsampling in the Side Layers of the SBD head.Each feature is resized and concatenated into a single tensor, again concatenated with the features obtained by the PPM.The tensor undergoes two 1 × 1 convolutional kernels to mix the features in the channel direction.Note that the number of convolutions can be modified.Finally, the features are split into the original shape and concatenated to the original side layer to be upsampled and fused.
In Figure 16, we show explicit feature fusion by applying the two-stream architecture proposed in GSCNN.We treat the SBD head as the Shape Stream, the final feature obtained in the Fuse Layer, and apply a 1 × 1 convolutional kernel similar to how GSCNN used the features from the Shape Stream.12a.Each column, from the left, represents the input image, backbone feature without SBCB, backbone feature with SBCB, prediction without SBCB, prediction with SBCB, and ground truth.We visualize two samples (two rows) per segmentation head.From the top row, the segmentation heads are FCN, ANN, GCNet, DNLNet, CCNet, UperNet, and OCR.Best seen in color and zoomed in.

Figure 3 :
Figure 3: Overview of the DFF Architecture.The architecture adds Adaptive Weight Learner on top of the CASENet architecture that learns attentive weights, which are applied to the outputs of the Fuse Layer.

Figure 4 :
Figure 4: Overview of the DDS Architecture.The DDS architecture is similar to CASENet with additional usage of Side4 features from the backbone as well as a deeper side layer called the Side Block.The Side Block consists of two Basic ResBlocks followed by the Side Layer used in CASENet.Instead of only using the fifth Side Block as auxiliary loss, the architecture utilizes a deep supervision method where the four earlier outputs of the Side Blocks are supervised by binary boundaries as well.

DDS.Figure 5 :
Figure 5: Overview of the Generalized CASENet Architecture.This architecture is a "generalized" version of the CASENet architecture in Figure 2. The last (N th) Side Layer is called the Semantic Side Layer and the input feature is called the Semantic Side.The 1 ∼ (N − 1)th Side Layer is called the Binary Side Layer since the input side feature (called the Binary Side) only has a single channel (like the other SBD architectures).With the generalization of not having a constrained number of sides and Side Layers, we can apply this SBD head to various backbones.This generalization can be applied to DFF and DDS architectures as well.

Figure 6 :
Figure 6: Diagram showcasing how the SBCB framework is applied to DeepLabV3+ segmentation head.

Figure 7 :
Figure 7: Diagram showcasing how the SBCB framework is applied to HRNet backbone with FCN segmentation head.

Figure 8 :
Figure8: Overview of the OTFGT module.We apply the signed distance function to segmentation masks to obtain category-specific distance maps.We then threshold the distances by the radius of the boundaries to obtain categoryspecific boundaries.The boundaries are concatenated to form a semantic boundary tensor for supervision.ImageMask

Figure 9 :
Figure 9: The two figures represents sample validation images, masks, and boundaries from the Cityscapes validation split, which we rescale and crop to 512 × 512.In each of the figures, we compare the two methods of preprocessing.The one on the left uses preprocessed boundaries, and the one on the right uses OTFGT boundaries.We can see that OTFGT boundaries have consistent boundary widths, while preprocessed boundaries will vary depending on the rescale value.

Figure 10 :
Figure10: The three main datasets that we used for the experiments.We show a sample input image, segmentation GT, and the result of OTF semantic boundary generation for each dataset.Humans annotate the Cityscapes and BDD100K datasets, and the segmentation masks are clean but tend to have imperfections around the boundaries and exhibit "polygon" masks.On the other hand, the Synthia dataset is data from a game engine, and the annotations are pixel-perfect, making this a challenging dataset for semantic segmentation.The segmentation mask for Synthia also contains instance segmentation, which is used in the OTF semantic boundary generation but not for the segmentation task.The BDD100K and Synthia datasets are less widely used than the Cityscapes dataset.However, the BDD100K and Synthia datasets contain more variations of natural noise and corruptions (weather, heavy light reflections, etc...) which will help benchmark the methods fairly.The images are best seen in color and zoomed in.cross entropy (BCE) loss for multi-label boundaries, L SBD followingYu et al. [2017].While CASENet and DFF use only multi-label boundaries for supervision, DDS also introduces deep supervision of edges where earlier side outputs are supervised with binary boundary maps using BCE loss L BdryLiu et al. [2022a].Generally, the loss function used is,

Figure 11 :
Figure 11: Overview of the task along with predictions of baseline methods and the model trained with the SBCB framework (CASENet).(a), (b), and (c) are the input image, ground-truth (GT) segmentation map, and GT semantic boundary map.Note that because the task of SBD is pixel-wise multi-label classification, the visualized semantic boundary maps have overlapped boundaries.In (d), we show the output of DeepLabV3+, a popular semantic segmentation model.The semantic boundary detection (SBD) baseline is CASENet, which we show in (e).The output of DeepLabV3+ trained with the SBCB framework using the CASENet head, which we show in (f) and (g).We can see that small and thin objects are recognized better using the framework and smoother boundaries with fewer artifacts.

Figure 12 :
Figure 12: Visualization of the backbone features and segmentation errors of DeepLabV3+ with and without the SBCB framework.From the left, each column represents the input image, last-stage features without SBCB, last-stage features with SBCB, segmentation errors without SBCB, and segmentation errors with SBCB.As we can see, the features learned using the SBCB framework exhibits boundary-aware characteristic because it is conditioned on semantic boundaries.Consequently, this results in better segmentation, especially around the mask boundaries.Best seen in color and zoomed in.Different crop size.In semantic segmentation, crop size is one of the most important hyperparameter, and we test the SBD heads on 769 × 769, another popular crop size.The results are shown in Table1b, where the general trend is the same as the results from Table1a.

Figure 15 :
Figure 15: Diagram of applying Channel-Merge module for explicit feature fusion based on the SBCB framework.

Figure 16 :
Figure 16: Diagram of applying two-stream approach for explicit feature fusion in the SBCB framework.The architecture is modeled after the GSCNN architecture.

Figure 17 :
Figure 17: Visualization of the segmentation masks and segmentation errors for the models in Table 11a.Each column, from the left, represents the input image, prediction without SBCB, prediction with SBCB, ground truth, segmentation errors without SBCB, and segmentation error with SBCB.We visualize two samples (two rows) per backbone.From the top row, the backbones are ResNet-50, DenseNet-169, ResNeSt-101, HR18, HR48, MobileNetV2, and MobileNetV3.Best seen in color and zoomed in.

Figure 18 :
Figure 18: Visualization of the segmentation masks and segmentation errors for the models in Table12a.Each column, from the left, represents the input image, backbone feature without SBCB, backbone feature with SBCB, prediction without SBCB, prediction with SBCB, and ground truth.We visualize two samples (two rows) per segmentation head.From the top row, the segmentation heads are FCN, ANN, GCNet, DNLNet, CCNet, UperNet, and OCR.Best seen in color and zoomed in.

Table 1 :
Ablation studies to compare SBD heads as auxiliary signals.The number of parameters and performance for SBCB is the model used for training.For inference, the number of parameters and performance equals the baseline.Unless explicitly stated, the hyperparameters for training are the same throughout the experiments (crop size of 512 × 1024 and ResNet-101 backbone.)(a) Results on the Cityscapes validation split.Results on the Cityscapes validation split.The crop size is set to 769 × 769.

Table 2 :
Per-category IoU for the Cityscapes validation split.
MethodSBCB mIoU road swalk build.wall fence pole tlight sign veg terrain sky person rider car truck bus train motor bike

Table 3 :
Results using ResNet-101 backbone with different sides on Cityscapes validation split.

Table 4 :
Ablation studies to compare different backbone conditioning methods.We investigate the effects on three popular segmentation heads: PSPNet, DeepLabV3, and DeepLabV3+.Note that all of the methods use ResNet-101 as the backbone.The number of parameters and performance for SBCB is the model used for training.For inference, the number of parameters and performance equals the baseline.
(a) Results on the Cityscapes validation split.

Table 5 :
Results obtained from the Cityscapes validation split.We compared the use of SegFix with auxiliary heads (SBCB and FCN) on three popular baseline models.

Table 6 :
Comparisons between DeepLabV3+ and GSCNN in the Cityscapes validation split.Note that the SBCB framework can be applied to train GSCNN.

Table 7 :
This table shows the configurations of the two common types of modifications on the ResNet backbone.Note that the outputs feature resolutions are in the order of Stem, Stages 1, Stage 2, Stage 3, and Stage 4.

Table 10 :
Results to compare the boundary Fscore, evaluated on the Cityscapes validation between the baseline models and the use of SBCB training.The models are trained using the same hyperparameters where the backbones are set to the ResNet-101 backbone.

Table 16 :
Results for BiSeNet and STDC on Cityscapes validation split.

Table 17 :
Xie et al. [2021]] [2021]work on modern backbones/architectures on the Cityscapes validation split.In this section, we applied the SBCB framework and the "Backbone Trick" to two modern architectures.ConvNeXt is a backbone composed of pure ConvNet components with design elements borrowed from vision Transformers (ViT)Dosovitskiy et al. [2021],Liu et al. [2022b].On the other hand, SegFormer is a full-blown segmentation architecture composed of a ViT-inspired backbone called the Mix Transformer (MiT), with a lightweight All-MLP segmentation headXie et al. [2021]Both architectures exhibit hierarchical feature extraction, which is compatible with the SBCB framework.The results of applying the SBCB framework are shown in Table

Table 18 :
Results using explicit feature fusion at the heads on the Cityscapes validation split.