BézierSeg: Parametric Shape Representation for Fast Object Segmentation in Medical Images

Background: Delineating the lesion area is an important task in image-based diagnosis. Pixel-wise classification is a popular approach to segmenting the region of interest. However, at fuzzy boundaries, such methods usually result in glitches, discontinuity or disconnection, inconsistent with the fact that lesions are solid and smooth. Methods: To overcome these problems and to provide an efficient, accurate, robust and concise solution that simplifies the whole segmentation pipeline in AI-assisted applications, we propose the BézierSeg model which outputs Bézier curves encompassing the region of interest. Results: Directly modeling the contour with analytic equations ensures that the segmentation is connected and continuous, and that the boundary is smooth. In addition, it offers sub-pixel accuracy. Without loss of precision, the Bézier contour can be resampled and overlaid with images of any resolution. Moreover, clinicians can conveniently adjust the curve’s control points to refine the result. Conclusions: Our experiments show that the proposed method runs in real time and achieves accuracy competitive with pixel-wise segmentation models.


Introduction
Image segmentation is a fundamental task in medical image processing.According to the statistics of the Grand Challenges (Grand Challenge, 2014) competition in biomedical image analysis, there are 10, 14, 13 tasks related to image segmentation in 2018, 2019 and 2020 respectively.Deep learning has achieved a prodigious success in image processing with unprecedented accuracy and generality (Wang, Han, Chen, Gao and Vasconcelos, 2019b;Ronneberger, Fischer and Brox, 2015;Chen and Merhof, 2018;Heker and Greenspan, 2020).As a result, nowadays most newly developed biomedical annotation tasks employ deep learning segmentation models such as Unet++ (Zhou, Siddiquee, Tajbakhsh and Liang, 2018), DeepLab v3+ (Chen, Zhu, Papandreou, Schroff and Adam, 2018), etc.However, these pixel-based deep segmentation models may not fully satisfy the need in medical application due to its own specialty: firstly, most targets are solid objects with continuous and confined boundaries, for instance, skin lesions.Secondly, biomedical annotation task is normally not the final task but a intermediate step of the whole therapy process.Further manipulations might be applied to the annotations for downsteam tasks like diagnose, lesion measurement, radiotherapy, etc.As for pixel-based algorithm, the output may not have a clear contour, or have burrs along the contour.Of course, one can do post-processing, like binarizing the segmentation heat map and pass it to OpenCV (Bradski, 2000) to find the contour.However, the output might contain multiple contours or a too complex contour for a single object.
The first case needs additional post-processing like Non-Maximum Suppression (NMS) to get a most-possible contour, and the second case needs contour-simplify algorithm to get a reasonable contour for doctors to further process.These drawbacks show that the pixel-based segmentation models might not be a straightforward solution for biomedical annotations.Moreover, all pixel-based segmentation models need upsampling operations that further slowdown the whole process.In this paper, we propose a contour-based model -BezierSeg, an end-to-end segmentation model that does not need upsample operations and can output a clear bezier contour directly.Bezier curve has been widely used in many enterprise design software because of its user-friendly properties.Therefore, the predicted bezier contour of the lesion area can be easily manipulated by doctors for further study.

Related Works
In deep learning, segmentation usually includes semantic segmentation and instance segmentation.The aim of semantic segmentation is to give pixel-wise classification results for the whole input image, and instance segmentation outputs bounding boxes for the detected objects and pixelwise segmentation within the bounding boxes.

Pixel-based Segmentation
Most pixel-based semantic segmentation models are fully convolutional networks (FCN), such as U-Net (Ronneberger et al., 2015), PSPNET (Zhao, Shi, Qi, Wang and Jia, 2017), BiSeNet (Yu, Wang, Peng, Gao, Yu and Sang, 2018) and DeepLab v3+ (Chen et al., 2018).U-Net has been widely used in biomedical segmentation problems.DeepLab v3+ is a cutting-edge semantic segmentation model developed by Google, which employs the atrous spatial pyramid pooling (ASPP) layer to exploit multi-scale features.Two stage methods perform instance segmentation by detecting bounding boxes then followed by pixel level segmentation within each bounding box, for example, the well-known Mask R-CNN (He, Gkioxari, Dollár and Girshick, 2017).It first detects objects and then uses a mask branch and RoI-Align to segment instances within the proposed boxes.To better exploit the spatial information inside the box, PANet (Wang, Liew, Zou, Zhou and Feng, 2019a) introduces bottom-up path augmentation, adaptive feature pooling and fuses mask predictions from fullyconnected layers and convolutional layers.Such two-stage approaches achieve state-of-the-art performance.One stage instance segmentation methods are free of region proposals.In these methods, models output the pixel-wise auxiliary information, then a clustering algorithm groups information into object instances.Deep Watershed Transform (Bai and Urtasun, 2017) predicts the energy map of the whole image and uses the watershed transform algorithm for grouping object instances.YOLACT (Bolya, Zhou, Xiao and Lee, 2019) generates prototype masks and the linear combination coefficients for each instance.Then linearly combines the prototype masks by corresponding coefficients to predict the instance-level mask.

Contour-based Segmentation
PolarMask (Xie, Sun, Song, Wang, Liu, Liang, Shen and Luo, 2020) uses polar representation to model the object boundary.The model is trained to simultaneously regress the object centroid as well as the length of 36 rays emitting uniformly with the same angle interval from the object centroid.Combined with the proposed Polar IoU Loss, several regression targets can be trained as a whole and thus improves the segmentation result.Instead of directly regressing the distance of the rays, ESE-Seg (Xu, Wang, Qi and Lu, 2019) proposes to further convert the polar coordinates of the object boundary to an explicit shape encoding using Chebyshev polynomial fitting, and turns the regression target to the coefficients of the Chebyshev polynomial.Although polar coordinates are inherently much suitable for modelling circular object boundary, both (Xie et al., 2020;Xu et al., 2019) model the radial distance as a single-valued function of the polar angle, which makes it impossible to model the case of the ray having multiple points intersect with the object boundary for a given angle.Curve-GCN (Ling, Gao, Kar, Chen and Fidler, 2019) is designed for assisting the manual annotation of class agnostic objects, it parametrizes object boundary with either polygons or splines, allowing annotation for both line-based and curved objects.They use graph convolutional network (GCN) to predict all the vertices of the polygon or control points of the spline along the object boundary in a iterative inference scheme.Deep Snake (Peng, Jiang, Pi, Li, Bao and Zhou, 2020) exploits the special topology of the object boundary as prior knowledge, and adopts circular convolution to predict the vertices along the object boundary.In order to obtain precise object boundary, both GCN and Deep Snake have to run inference multiple times.

Parametric Representation
Parametric equations are commonly used to express the coordinates of points that make up a geometric object such as curve or surface.The general form of a parametric equation in two dimensions is shown in Eq. (1): where ∈ is the parameter and ( ), ( ) are any explicit function of .To recover the object shape, one can sample from the domain of the equation and obtain both and according to Eq. (1).Different from using a single-valued function to model the object shape (Xie et al., 2020;Xu et al., 2019), a parametric equation expresses the and coordinates of the object shape independently, which allows it to mimic a multi-valued function.This property provides much flexibilities compared to the single-valued function for shape representation.
Bezier curve is a set of parametric curve that have been widely used in many fields as an efficient design tool.The explicit definition of a bezier curve can be expressed as Eq. ( 2): where is the binomial coefficients, = ( , ) is theℎ control point of the bezier curve and is the degree of the bezier curve.One can construct a bezier curve by following these steps: 1. Create the control polygon of the bezier curve by connecting the consecutive control points.

Insert intermediate points to each line segment, with
the ratio ∶ (1 − ).

Treat the intermediate points as the new control
points, repeat step 1), 2) until you end up with a single point.4. As varies from 0 to 1, the trajectory of that single point forms the curve.
Fig. 1 shows this process for constructing a cubic bezier curve.The use of bezier curve can reduce the number of parameters for shape encoding.As shown in Fig. 2, although the curve is determined by only four control points, it can guarantee the shape quality since the precision of the curve representation is fully depends on how densely you sample s from [0, 1], whereas the polygon representation requires more vertices to achieve the similar precision.
Note that one can model the object shape using bezier curve in polar coordinates as well.In order to make the manually post refinement of the segmentation result much  easier, we instead model the object shape in Cartesian coordinates.We conducted an experiment to study the accuracy of curve regression with respect to the degree of bezier curve and according to Table .1, we choose = 5 for a trade-off between usability and segmentation accuracy.To retain higher accuracy after converting the object boundary to bezier curve representation, we propose to parametrize the object shape by piecewise bezier curves.Specifically, we first split the whole object boundary by its four extreme points, i.e., the top, the leftmost, the bottom and rightmost points of the object.Then for each part of the object boundary, we parametrize it with a bezier curve.Fig. 3 (a) gives an example of our bezier curves representation.We also compare our bezier curve representation with the Chebyshev polynomial shape encoding proposed in (Xu et al., 2019).In Fig. 3 (b) we found that our bezier curve representation can avoid the oscillation at the end of the object boundary compared to the Chebyshev polynomial shape encoding.

Model Architecture
As shown in Fig. 4, we adopt ResNet-101 (He, Zhang, Ren and Sun, 2016) as the backbone of our model.We simply remove the softmax activation layer, and change the number of output nodes in the last fully connected layer to match the regression targets.For example, if extreme points and control points are needed, the output of the last fully connected layer should be ( + ) * 2 nodes, without any activation layer.Note that the idea in this paper can be easily applied to the mainstream object detection frameworks, which provides an alternative for instance segmentation.

Ground Truth Label Generation
Given a mask of an object, we first extract the object boundary from the mask.Then, we find four extreme points of the object and use them to split the whole shape into four parts.If there exist multiple extreme points such as two or more top points, we use the top left corner point, same as the bottom left one, the bottom right one and the top right one for leftmost, bottom, rightmost extreme point respectively.For each part of the object boundary, we fit it with a fifthorder bezier curve.The first and last control points of the bezier curve are set to be the begin point and end point of that piece of shape, which are two consecutive extreme points as well.The four intermediate control points of the bezier curve remain unknown.We convert Eq. ( 2

⎡
in which represents the total number of points of the object boundary part, represents the number of control points of the bezier curve, ( , ) is theℎ control point, ( ̂ , ̂ ) is theℎ points of the object boundary part and = (1 − ) − , ∈ [0, 1].Inspired by the De Casteljau's algorithm for constructing a bezier curve, we assume that = ( − 1)∕( − 1) for theℎ point of the object boundary part.Since the first and the last row of Eq. ( 3) are identity expression representing the extreme points, the first and last column are constant values, we remove the first and the last row from Eq. ( 3) and subtract the constant values from both side of Eq. (3).To simplified the notation, let: By computing the pseudo inverse of and multiply it to both side of the equation, we can obtain the coordinates of the intermediate control points = −1 .Finally, we concatenate the coordinates of four extreme points and 16 intermediate control points as our regression target ( = 4, = 16), resulting in a 40 dimensional vector for each input image.Similar to ESE-Seg, we also conduct sensitivity analysis for our bezier curve representation.Specifically, we randomly sample some noises from (0, ) and add them to the control points as well as the extreme points to imitate the uncertainty behaviour of the convolutional neural network.We compare our bezier curve representation with the polygon shape representation.To maintain the same complexity for both shape representation, we evenly select 20 points along the object boundary for the polygon representation.Fig. 5 shows the robustness of our bezier curves representation.As shown in Fig. 5, bezier curves representation always achieves higher Mean Intersection-Over-Union (MIOU) than polygon representation with the same number of points.We can conclude that bezier curves representation is superior than polygon representation because it allows us to describe a curve more precisely.The bezier curves representation is robust for the endoscopy images of upper gastrointestinal cancers (EIUGC) dataset and the international skin imaging collaboration skin lesion challenge (ISIC) dataset, remaining above 0.6 miou even the noise of = 20 was injected.For the same perturbation, the miou of the nasopharyngeal carcinoma magnetic resonance imaging (NPCMRI) dataset dropped rapidly mainly due to its smaller average size of ROI comparing to the other two datasets.

Bezier Differentiable Shape Decoder
Similar to FourierNet (Riaz, Benbarka and Zell, 2021), We devise a novel differentiable shape decoder named Bezier Differentiable Shape Decoder, abbreviated as BDSD.Fouri-erNet used Inverse Fast Fourier Transformation as shape decoder to convert the coefficients of the Fourier series into contour points whereas we apply the parametric equation of bezier curve to map the control points to contour points.Following Eq. ( 2), we randomly sample s from [0, 1] and use the same s to reconstruct ground truth boundary points and the corresponding predicted boundary points.In all experiments, we set = 72.Please note that the BDSD module is fully differentiable and allows the gradient to back propagate through the network and thus can provide extra supervision during the training phase.

Loss
The overall loss function for training our model contains two individual loss terms.The first one  is the Smooth L1 loss for control points and extreme points regression, and the second one  ℎ is another Smooth L1 loss for point matching learning, which measures the differences between the outputs of the BDSD module.Eq. ( 4) shows the overall loss function: where and ℎ are balancing hyper parameters, we set both of them equal to 1 in all experiments.The international skin imaging collaboration (ISIC) skin lesion challenge datasets of 2018 was released in ISIC Challenge (International Skin Imaging Collaboration, 2018), which aimed at skin lesion analysis towards melanoma detection.One of its three tasks is lesion segmentation.The dataset includes 2,576 skin images and their corresponding masks, of which 2060 in training set, 258 in validation set and 258 in test set.All ground truth data were reviewed and curated by practicing dermatologists with expertise in dermoscopy.The image may be associated with multiple mask annotations.In that case, we randomly select one of the expert annotation and fall back to use novice annotation if no expert annotation provided.The above three datasets are summarized in Table .2.

Results
As shown in Table .3, the segmentation performance of BezierSeg 50 is slightly worse than BezierSeg 101, we therefore use ResNet-101 as the backbone of our model by default.Without any bells and whistles, BezierSeg can achieve competitive result with pixel-based methods.On both EIUGC and ISIC datasets, the differences of miou between these two methods are less than 0.02.The average ROI sizes in EIUGC, NPCMRI and ISIC are 29337 pixels, 14152 pixels and 745 pixels respectively.BezierSeg outperforms DeepLab v3+ ResNet101 on NPCMRI by 21% and behind DeepLab v3+ on EIGUC and ISIC by only 1.8% and 0.6% respectively.In terms of hausdorff distance, BezierSeg outperforms DeepLab v3+ on both ISIC and NPC.PolarMask performs relatively poorly on all three datasets.From Fig. 6 we can see that when the ROI size is greater than 20k, DeepLab v3+ ResNet101 performs slightly better; when the ROI size is between 4k and 20k, the proposed model and DeepLab v3+ ResNet101 perform similarly; when the ROI size is less than 4k , Both BezierSeg and PolarMask outperforms DeepLab v3+ ResNet101.The overall performance of BezierSeg is less sensitive to the size of ROI comparing to DeepLab v3+ ResNet101.As show in Fig. 7, BezierSeg can always output a smooth contour, whereas DeepLab v3+ ResNet101 outputs rugged contour and often produces multiple separated contours.PolarMask gives polygonal outputs that roughly ensures the smoothness, however, it is difficult to handle the dumbbell shape and the model outputs pebble shape in most of that case.
We also compare the FPS between BezierSeg, Polar-Mask and DeepLab v3+ on the same machine equipped with one Tesla V100 graphics card.As shown in Table .4, BezierSeg, which does not require upsampling layers, has 42.6M parameters while DeepLab v3+ ResNet101 has 58.6M.With simpler pipeline, BezierSeg doubles the FPS to DeepLab v3+ ResNet101.Both BezierSeg and Polar-Mask are lightweight models with similar architecture, they show no significant difference when running without postprocessing.However, BezierSeg consumes more time to reconstruct the smooth contour during the post-processing stage.The speed of BezierSeg makes it more suitable for real time medical segmentation.A smaller size of model has more flexibility to transfer to edge-computing device.

Implementation Details
For training DeepLab v3+ ResNet101, we use binary cross entropy loss function and initialize the model with the weights pre-trained on COCO (Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár and Zitnick, 2014) train2017, and for training BezierSeg, we initialize the backbone of our model with the weights pre-trained on ImageNet (Deng, Dong, Socher, Li, Li and Fei-Fei, 2009).As for the configuration of PolarMask, we use ResNet101 as backbone and use 36 rays for shape representation.During training, for the EIUGC and the ISIC datasets, we use data augmentation techniques such as random horizontal flipping or vertical flipping with a probabilities of 0.5, image rotation with a  (256,256).For both training and inference phases, we normalize the input images by dividing their pixel value by 255.
We dynamically generate both the extreme points and the control points after performing data augmentation.For the ISIC dataset, we apply morphological transformations to the mask to remove the spikes along the object boundary before fitting the bezier curves.The initial learning rate is set to 1e-3 for all experiments.We also adopt adaptive learning rate scheduling strategy during training.Specifically, we reduce the learning rate by half when the loss of the validation set has stopped decreasing for 15 epochs and the monitoring threshold is set to 1e-2 and 0.5 for DeepLab v3+ ResNet101 and BezierSeg respectively.We train the models for 100 epochs in all experiments and evaluate the performance on the validation set at the end of each epoch.We pick up the model with highest miou on the validation set to evaluate the performance on the hold-out test set.The final results are obtained by running the experiments three times with different random seed and taking the average.

Conclusions
We propose the BezierSeg model, which can directly output bezier curves without post-processing.Its simple pipeline provides the capability of real-time inference.In the experiments, it can be found that BezierSeg shows competitive accuracy in multiple medical datasets.However, our model fails to handle disconnected areas such as dilated shape contour due to the limitation of the 2D bezier curve representation.One possible solution for this case is to incorporate parametric surface representation such as level set.We will explore the potential application of bezier surface for 3D object segmentation in our future work, which is also a promising and practical segmentation solution for three-dimensional medical data such as computed tomography (CT) and magnetic resonance (MR) images.

Figure 1 :
Figure1: The process of constructing a cubic bezier curve.The trajectory of the red dot forms the curve as varies from 0 to 1.

Figure 2 :
Figure 2: Bezier curve vs polygon.The polygon representation requires more vertices in order to precisely describe the shape.

Figure 3 :Figure 4 :
Figure 3: (a) Our piecewise bezier curves representation.(b) Our bezier curve representation can avoid the oscillation at the end of the object boundary compared to Chebyshev polynomial shape encoding.

Figure 5 :
Figure5: Sensitivity analysis.The result of the analysis proved that the bezier curve representation is more robust than polygon representation under different noise levels.

Figure 7 :
Figure 7: Qualitative results comparison between BezierSeg, DeepLab v3+ ResNet101 and PolarMask on three datasets.The first row, second row and the last row show the results on the EIUGC dataset, the ISIC dataset and the NPCMRI dataset respectively.Green: Ground-truth label; Red: DeepLab v3+ ResNet101; Blue: BezierSeg; Yellow: PolarMask.

Table 1
Degree of bezier curve.Higher degree of bezier curve brings higher accuracy, while overmuch degree degrades the performance.

Table 2
Dataset overview.The MIOU and SIOU in the table represents the mean intersection-over-union and the standard deviation of the intersection-over-union between the bezier curve representation and the ground truth mask respectively.