A New Regularization for Deep Learning-Based Segmentation of Images with Fine Structures and Low Contrast

Deep learning methods have achieved outstanding results in many image processing and computer vision tasks, such as image segmentation. However, they usually do not consider spatial dependencies among pixels/voxels in the image. To obtain better results, some methods have been proposed to apply classic spatial regularization, such as total variation, into deep learning models. However, for some challenging images, especially those with fine structures and low contrast, classical regularizations are not suitable. We derived a new regularization to improve the connectivity of segmentation results and make it applicable to deep learning. Our experimental results show that for both deep learning methods and unsupervised methods, the proposed method can improve performance by increasing connectivity and dealing with low contrast and, therefore, enhance segmentation results.


Introduction
Image segmentation is one of the most fundamental and important tasks in image processing and computer vision. Image segmentation aims to divide a digital image domain into several disjoint regions so that each region is homogeneous with respect to certain characteristics. For binary case it also refers to outlining the boundaries of objects of interest so that the target objects are easier to recognize. It has been applied and studied in many areas, such as biomedical imaging and geosensing.
Different methods of image segmentation have been developed. Variational methods are one of the popular ones that usually rely on optimizing some well-designed energy functional. Many classic variational models have been proposed. For example, Mumford-Shah model [1] does simultaneous image smoothing and segmentation and represents the smoothed image as a piecewise-smooth function. Potts model [2] and Chan-Vese model [3,4] adopt piecewise constant approximation. Level set and Heaviside functions are used in Chan-Vese model to represent different regions. Heaviside function is however not convex and causes optimization to become stuck in local minima. Some methods [5] apply convexity relaxation to guarantee a global solution of the models. The models usually seek optimal segmentation by minimizing some specific energy functional. The functionals usually contain a data fitting term that penalizes the error between the approximation and original image, and one or several regularization terms which represent the mathematical assumption or expectation of the underlying solutions [6][7][8].
With the development of a fully convolutional network (FCN) and improvement of computing power, deep learning-based methods have achieved remarkable performance on image segmentation. With the images as input, the trained neural network can classify each pixel of the input images by predicting the probabilities of each class. Recently, many new deep learning models and structures have been proposed. Examples are GoogleNet [9], UNet [10], Res-Net [11], Xception [12], and DeepLab models [13][14][15][16]. There are also some combinations, such as Res-UNet [17]. Compared with variational methods, deep learning methods can extract high-level features of images, as well as low-level features such as edges by feeding a set of training images into the neural network. That allows it to extract deeper meanings of pixels and deal with more complicated tasks, such as semantic image segmentation.
On the other hand, variational methods are still popular in many areas, such as medical image processing [6], due to some advantages. First, most variational methods [1,[3][4][5]18] are unsupervised or semi-supervised so that they are still available with little or without labeled data. Second, it is flexible to add many kinds of spatial regularization into the objective function [6,8], which helps variational methods obtain more reasonable results. Therefore, it is a good idea to apply spatial regularization into neural networks to combine the advantages of the two methods.
In fact, some methods adding regularizations into a neural network have been proposed. Some of them modify the loss functions by using objective functions of variational models [19] or directly adding spatial regularization [20,21] to the loss functions. Some methods make use of feature extraction and attention mechanism [22] to fuse region features and boundary features. In addition, some methods apply spatial regularization into activation functions by representing activation functions as the solutions of some variational problems [23]. There, the authors have discussed the advantages of these methods over other methods. First, by modifying the activation function, they can affect the process of back-propagation and, therefore, the learnable parameters, while the loss functions usually only affect the training stage. Second, they are flexible to be applied in any number of activation functions in any layers. In this paper, we propose a new regularity that could be used alone in non-supervised methods or adopted into activation functions. It is only used in the last layer for a better efficiency without losing much performance.
Total variation (TV), one of the most widely used spatial regularization during image segmentation, can minimize the boundary length and has the effect of removing noise. However, it is not suitable for images with low contrast, fine and multi-scale structures. Examples are blood vessels in medical images, branches of trees and antennae of insects. These images vary greatly in scale and shape. Low contrast makes it even worse as fine structures are often mistakenly treated as background and cause missing edges of fine structures. In these cases, a spatial regularization that could lengthen the boundary or enhance connectivity would help to catch boundaries of fine structures. Although some regularization methods have been proposed to improve connectivity by computing discrete curvatures for each piece of edges [24,25], they are used in variational models and are very difficult to be applied to deep learning. Moreover, it is inefficient to compute curvatures for each image and each training step in deep learning. In this paper, based on a soft thresholding dynamic (STD) regularization which can make boundary smooth if added to activation function [23], we design a new connectivity boosting regularization that is very easy to be adopted in deep learning framework.

Deep Learning Based Image Segmentation
Neural networks for image segmentation consist of a series of linear and non-linear operations. After being fed into a neural network, the image passes through the hidden layers in sequence. In a simple FCN, the input or feature map of each layer is the output of the previous layer. Suppose the output of the ith layer is o i (o 1 is the input image), then the output of the next layer would be: W i and b i represent linear operation and contain learnable parameters of the ith layer. A i , called activation function, is a non-linear function and usually remains unchanged. Activation functions add non-linearity to the neural network and enable the neural network to fit any non-linear functions. For image segmentation, the final output of the last layer has the same shape as the input image so that the neural network can predict the probability of each pixel belonging to each class.
In the process of training, the discrepancy between the output of training images and ground truth are computed according to a certain loss function. By applying backpropagation, the gradients of the loss function with respect to each parameter are obtained so that the neural network can 'know' how to update the parameters. Specifically, the parameters are updated with gradient descent-based algorithms to reduce the loss.

Spatial Regularization in Variational Methods
As mentioned, variational methods involve minimizing some energy functions that can be simply formulated as the sum of a fidelity term and a regularization term: The fidelity term F(u, o) measures the similarity between the original image o and the segmentation result u. For example, the Euclidean distance ||u − o|| 2 2 is used in many variational models. The regularization term R(u) is designed to have some certain influence on the result. Many kinds of regularity terms, such as total variation, Tikhonov, total generalized variation [26] and total fractional variation [27] have been proposed. In particular, TV regularization can be expressed as the sum of the norm of gradient: |∇u|.
In an image, non-zero gradient usually means appearance of boundary. Therefore, TV regularization can approximate the boundary length and the segmentation result would be more robust to noise and preserve edges better.
Although TV regularization performs well in segmentation tasks, its non-smoothness property makes it not efficient in deep learning. Typically, the variational problems with TV regularization can only be solved with some methods that are too complicated for deep learning, such as alternating direction method of multipliers (ADMM) and primal dual methods. Compared with gradient-based methods, such as gradient descent, these methods usually converge much slower and each of their steps is computationally expensive. The computation of total variation itself is time consuming too. Since the training process of deep learning can take hundreds of epochs, the regularization used in deep learning should be able to be solved efficiently.

Soft Threshold Dynamic (STD) Regularization
To combine deep learning framework and spatial regularizations, Liu et al. [23,28,29] have given variational explanations for some widely used activation functions in deep learning including softmax, ReLU, and sigmoid [23,28,29]. For example, softmax operator can be written as softmax and it can be represented as the solution of the following optimization problem [23] when the parameter ε is 1: where Ω is the domain of the whole image, C is the number of classes, o is the feature map input, and u is the output of the softmax operator. Although problem (2) has no regularization about u, one can easily add regularization terms used in image processing and computer vision. In [23], soft threshold dynamics (STD) [30] was adopted: where * means convolution, k represents a discrete Gaussian kernel and λ is a parameter. The STD regularization term u, k * (1 − u) comes from convolution generated motion [31,32] whose goal is to simulate the motion of interfaces of some dynamic systems, such as chemical and biological systems. When simulating the dynamics of interface under surface tensions [30], it assumes that the interfacial energy is proportional to the length of interface. Thus, STD term estimates the boundary length. Compared with TV regularization, STD regularization has similar effect. However, it is smooth and, thus, problem (3) can be solved with much more efficient algorithms. Furthermore, STD regularization involves convolution and inner product operation, which are cheaper to calculate than norm.
Theoretically, STD regularization can be added to arbitrary numbers of activation functions in any layers in a neural network, but in this paper, the proposed regularization to the softmax function is just applied in the last layer for some reasons. First, solving the optimization problems for many layers is time consuming because most of them can only be solved with iterative methods. Second, the numerous feature maps in the hidden layers represent certain features of the image. Applying regularization in these layers may have unknown influence on the extraction of the features. In practice, the parameters ε, λ, and σ can be set as learnable so that they can be tuned automatically by neural network.

Explanation of STD Regularization
The STD regularization term u, k * (1 − u) is an approximation of boundary length while used in image segmentation model. By computing the convolution of the image with a Gaussian kernel, a series of weights related to curvature can be generated along the boundaries of different regions. The sum of these weights along the boundary can be a good approximation of the boundary length. To help understand STD regularization and illustrate the effect of Gaussian kernel, an example of binary u with values 0 and 1 is shown in Figure 1. Only pixels along the boundary of u have non-zero values in u k * (1 − u). Additionally, the inner product u, k * (1 − u) is the summation of u k * (1 − u). The pixels along the corners have higher values, which means u, k * (1 − u) has larger weight when the curvature is higher. Therefore, the regularization is an approximation of boundary length weighted by curvature. Note that the Gaussian kernel in STD makes the regularization penalize arc length in all directions. In certain instances, however, we need to enhance connectivity which leads to longer arc length along certain directions. For instance, when the dominant edges of an image are horizontal, we should allow long arc length along the horizontal direction (see Section 4.2 for one example). Therefore, it is necessary to design a new regularization that can enhance the connectivity in a selective way.

New Regularization Term
To enhance connectivity, a new regularization term is designed to expand the foreground. Naturally, suppose the values of u on the foreground are close to one, the regularization term can be formulated as Figure 2 shows the effect of the regularization term (4) with Gaussian kernel. The foreground value becomes −1 and the boundary values are also negative. So the inner product should be the minus sum of weighted boundary length and foreground area. The regularization (4) tends to expand the area of the foreground in a rate weighted by the boundary curvature. The foreground will not expand without limit due to the existence of loss function and tuned parameters. In practice, it would be better to enhance connectivity and elongate boundary length in a certain direction, and it is not difficult to design such kernels. For example, a new 3 × 3 kernel can be defined as shown in (5). In Figure 3, u k 1 * (−u) has larger weights on horizontal boundary, therefore when expanding the foreground it has higher priority to elongate vertical boundary. Similarly, we can design such kernels in three other directions: horizontal, 45 degree, and 135 degree diagonals.

Proposed Model
To better handle images with low contrast multi-scale features that could be disconnected by existing segmentation method, the new regularity is added to the model (3) to obtain the model (6). The parameters λ i and λ are balancing weights. In deep learning framework, they can be set as learnable parameters so that the weights can be adjusted continuously. Notice that we remain the STD regularization term (4th term) to address noise. One can choose to remove or retain STD for different problems. Generally, it should be retained if the noise in background is strong.
We also note that all the four kernels in the model are used. Our deep learning framework can, however, automatically update weights λ i and they will be positive and different, with the largest one corresponding to the the dominant direction of connectivity to enhance (see one example and the explanation in Section 4.2).
The objective function is the sum of a series of smooth functions. The kernels k 1 , k 2 , and k 3 are semi-positive definite while k 4 is indefinite. In most cases, the weighted sum of kernels is semi-positive definite, and the regularization term is concave. Then the problem (6) can be solved with the iterative method proposed by [23]: where p t = ∑ 4 i=1 λ i k i * (−2u t ) + λk * (1 − 2u t ) ∈ ∂R(u t ) and ∂R(u t ) is the subgradient of the sum of the 3rd and 4th term in the model (6). The solution of each step is: When λ 4 is too large the weighted sum of kernels may not be strictly semi-positive definite. The general model can be solved with Proximal Forward-Backward Scheme (PFB) [33,34] efficiently. It converges quickly for a wide range of non-convex problems whose objective function is the sum of smooth and non-smooth functions without assuming convexity. The iterative algorithm with PFB in the t-th step is: where τ is a constant to be tuned. The closed form solution can be represented with Lambert W function W: where I is a tensor of the same size as u with all the entries equal to one. In practice, the value of Lambert W function can be well approximated by Winitzki's approximation [35].
In the final step, to make sure the constraint is satisfied, one more softmax operation on u is performed. Both of the methods will converge within about 10 steps.

Results
For the purpose of testing whether the proposed model could enhance connectivity in image segmentation with low contrast, we focus on crack images and retinal vessel images for their difficulty in obtaining connected fine structures under the condition of low contrast. First we use the Crack Forest Dataset [36] including 118 forest images of size 448 × 448. In total, 18 images are randomly selected as the test set. The tested dataset of retina vessel segmentation is DRIVE [37], which is one of the most frequently used retina vessel datasets. It consists of 40 color images of size 565 × 584, 20 of which are used as the training set. Both two datasets vary greatly in scale and shape with low contrast. In addition, two images are used to test unsupervised image segmentation, respectively. STD regularization is used for the Crack Forest Dataset.
The deep learning model is based on U-Net whose convolution layers are replaced by depthwise separable convolution layers with residual connections used in Xception [12] architectures. Such a structure can largely improve the performance of the original U-Net. Preprocessing includes random flip, randomcrop, and contrast limited adaptive histogram equalization (CLAHE) [38]. Both random flip and randomcrop are popular data augmentation techniques by artificially expanding the size of dataset. They can artificially expand the size of dataset and avoid overfitting [39], which is especially helpful for training dataset of small size. CLAHE is a powerful method of image enhancement and has been proved to be able to improve the quality of retina vessel and crack images [40,41]. Since the preprocessing, initial parameters and shuffle of mini batch in deep learning can produce some randomness, for each dataset, we calculate the average values and standard deviations of five computations.

Evaluation Metrics
The results are compared and evaluated with some commonly used metrics including accuracy (Acc), precision (Pre), sensitivity (Sen), specificity (Spe), F1 score (F1), and AUC, i.e., the area under the receiving operator characteristic (ROC) curve. In binary classification they are defined as: where TP is true positive, i.e., the number of truly classified positive pixels representing the foreground or the vessels. Similarly, TN, FP, and FN are true negative, false positive, and false negative. AUC is the area under the curve created by 1 − Spe in the x axis and Sen in the y axis. Sensitivity is the accuracy rate for the foreground. Specificity is the accuracy for the background. A higher sensitivity means better connectivity for the vessels while the effect of specificity is inverse. Therefore, the main effect of my model is increasing the sensitivity at the cost of slightly decreasing specificity.

Results and Discussion
First, the development of λ i 's is tested with 30 selected crack images. These images are selected from Crack Forest Dataset in which the cracks are all in horizontal direction ( Figure 4). As shown in Figure 5, after 50 epochs, the value of λ 2 becomes the smallest so that the weight of k 2 is the smallest. Note that k 2 corresponds to the horizontal kernel, and, therefore, the trained network tends to enhance the connectivity in horizontal direction. This example shows that the neural network can automatically decide the dominant direction of the boundary.

Crack Forest Dataset
First, the full Crack Forest Dataset is tested. The performance with or without the proposed regularization is compared in Table 1. The proposed regularization improves the performance of the first four metrics except AUC. Specifically, as we expected more pixels are classified as foreground, therefore the regularization can obviously increase sensitivity and thus improve connectivity. An example shown in Figure 6 more explicitly illustrates the improvement. In the bottom right, while preserving the main part of the crack like the bottom left, the proposed regularization helps recognize more details of small branches and connect some disconnected parts.

Retina Vessel
We test the performance of our model on DRIVE dataset. The comparison of results with and without the proposed regularization are shown in Table 2. As expected, the sensitivity is increased at a cost of slightly reduced specificity. While the accuracy and F1 score almost remain unchanged, the AUC is largely improved. An example is shown in Figure 7, and a zoomed region indicated by white boxes is shown too. Some missing branches are elongated, and some disconnected parts are connected. In Table 3, our results are compared with some state-of-the-art methods. Our accuracy and specificity are the highest among them and our sensitivity and AUC are also comparable to them. F1 score is not included because few of the papers show it. Even better performance are expected if we apply attention mechanism or more pre-and post-processing techniques.   Table 2. From left to right: ground truth, results with original softmax, and results with proposed regularization.

Unsupervised Model
To show the effectiveness of the proposed regularization term, we add it to the energy function in the variational model [5] that only involves total variation. The optimization problem can be formulated as: where f 1 = ( f − c 1 ) 2 and f 2 = ( f − c 2 ) 2 , f represents the image and c 1 and c 2 are the means of the foreground and background indicated by u. Because in unsupervised method the parameter λ's are fixed, we just apply Gaussian kernel in the regularization added with TV. Since the added regularization term is smooth, the modified model can be solved with commonly used methods, such as ADMM. The proposed model (16) is compared with the one in [5] and some results are shown in Figures 8 and 9. Figure 8 is a typical example of images with multi-scale features, the legs of the insect are long and slim, and some parts of the insect mix with the background due to low contrast. In variational models these parts with low contrast are naturally recognized as background. However, in the third figure of Figure 8, the legs are connected as a whole and some noise on the body caused by the texture pattern are removed. Another example is shown in Figure 9, which is human arteries. The image noise is strong and the contrast between the vessels and the background is very low. As shown in the second column of Figure 9, if the effect of TV is strong, the model will eliminate some fine structures together with noise. However, in the third column, the model retains some vessels, and we can find the corresponding parts from the original image in the first column. The parameters λ 1 and λ 2 in the Equation (16) should be fine-tuned. Typically, λ 1 is between 0.01 to 1. λ 2 should be as small as 10 −6 . Both results show the potential of the proposed regularization in enhancing segmentation of images with multi-scale structures and low contrast.

Conclusions
A novel spatial regularization for image segmentation is proposed, which can be applied flexibly in neural networks. It is designed to enhance segmentation of fine structures especially in images with low contrasts through improving connectivity. During process of training the neural network can learn to find the dominant direction of the boundary. Our results show that the proposed regularization can improve the performance of neural network on some suitable datasets. Specifically, we test the retina vessel and forest crack datasets and achieve better results compared with some recently proposed models. We observe obvious improvements of sensitivity corresponding to better connectivity. In addition, the effect of the proposed regularization applied in an unsupervised model is tested and we find improvement. In the future, we will focus on improving connectivity locally. Additionally, we will continue to design other specific spatial regularizations based on similar mechanism.

Conflicts of Interest:
The authors declare no conflict of interest.