Two-Stage Framework for Faster Semantic Segmentation

Semantic segmentation consists of classifying each pixel according to a set of classes. Conventional models spend as much effort classifying easy-to-segment pixels as they do classifying hard-to-segment pixels. This is inefficient, especially when deploying to situations with computational constraints. In this work, we propose a framework wherein the model first produces a rough segmentation of the image, and then patches of the image estimated as hard to segment are refined. The framework is evaluated in four datasets (autonomous driving and biomedical), across four state-of-the-art architectures. Our method accelerates inference time by four, with additional gains for training time, at the cost of some output quality.


Introduction
Neural networks for segmentation typically produce a probability, for each pixel, of belonging to the region of interest [1][2][3][4]. The same number of arithmetic operations is performed for all pixels, which seems computationally inefficient because there are regions of the image that may be harder to segment than others. Intuitively, that does not seem how humans would produce a manual segmentation. We would create a rough draft and then refine the parts of the segmentation that require more detail. Furthermore, many segmentation applications are imbalanced. The background predominates, and the background is often easier to segment.
In applications with very high-resolution images, such as high-resolution digital microscopes, a common practice is to split the image into patches and process each patch separately [5]. However, such an approach does not solve the fact that resources are being spent equitably when it makes sense to unevenly deploy these resources across the regions of the images. Our proposal tries to be more selective on the patches it chooses. Other areas that may also benefit from low-cost segmentation, even with some accuracy penalty, include single-board computers, such as the Raspberry Pi microcontroller that are used for smart houses, security, or in retail to estimate customers entering a store [6].
As illustrated by Figure 1, the proposal is to produce a sequential segmentation method whereby Model 1 segments a lower-resolution version of the image. Based on the probability scores from this first model, the harder-to-segment regions are identified. These regions are then fed into Model 2 for further processing. Finally, the output of both models is then combined.
Iterative segmentation methods already exist [7][8][9][10], but their focus is on improving the quality of the segmentation, not the speed. Our proposal is applicable to every type of high-resolution image, but it is tested over two types of images (biomedical and autonomous driving).
The paper expands on a previous conference paper [11] by introducing three crucial improvements to the pipeline and an expansion of the experiments: (i) the second model is belonging to the region of interest [1][2][3][4]. The same number of arithmetic operations is 12 performed for all pixels, which seems computationally inefficient since there are regions 13 of the image that may be harder to segment than others. Intuitively, that does not seem 14 how humans would produce a manual segmentation. We would do a rough draft and 15 then refine the parts of the segmentation that require more detail. Furthermore, many 16 segmentation applications are imbalanced and the background predominates, and the 17 background is often easier to segment. 18 In applications with very high-resolution images, such as high-resolution digital 19 microscopes, a common practice is to split the image into patches and process each patch 20 separately [5]. However, such an approach does not solve the fact that resources are being 21 spent equitably when it makes sense to unevenly deploy these resources across the regions 22 of the images. Our proposal tries to be more selective on the patches it chooses. Other 23 areas that may also benefit from low-cost segmentation, even with some accuracy penalty, 24 include single-board computers such as Raspberry Pi microcontroller that are used for 25 smart houses, security, or in retail to estimate customers entering a store [6].

Related Work
In broad strokes, deep learning architectures for neural networks consist of two encoder-decoder sequential blocks, as shown in the following Figure 2.
, 2023 submitted to Sensors 2 of 8 probability scores from this first model, the harder-to-segment regions are identified. These 29 regions are then fed into Model 2 for further processing. Finally, the output of both models 30 is then combined.

31
Iterative segmentation methods already exist [7][8][9][10], but their focus is on improving the 32 quality of the segmentation, not the speed. Our proposal is applicable to every type of high-33 resolution image, but it is tested over two types of images (biomedical and autonomous 34 driving).

35
The paper expands on a previous conference paper [11] by introducing three crucial 36 improvements to the pipeline and expanding the experiments: (i) the second model is 37 trained from the first model; (ii) the first model is connected to the second model to provide 38 context; (iii) three sampling strategies are considered to choose the image patches for 39 training of the second model.

40
Besides this Introduction, the paper is organized as follows: section 2 better explores 41 the related work; section 3 explains the proposed algorithm; section 4 details the exper-42 iments performed and discussed the results; finally, section 5 concludes the paper. The 43 source-code with the implementation used to produce the experiments of this paper is 44 publicly available at https://github.com/rpmcruz/faster-segmentation.

46
In broad strokes, deep learning architectures for neural networks consist of two 47 encoder-decoder sequential blocks, as shown in the following Figure 2.  The encoding phase reduces the image into a smaller and more compact, higher-level 49 representation, while the decoder projects that latent representation into a segmentation 50 with the same size as the original image. Four major architectures are considered in this 51 work: Fully Convolutional Network (FCN) [1], SegNet [2], U-Net [3], and DeepLab [4]. 52 FCN is considered a pioneering approach to deep-based image segmentation. It uses 53 successive convolutions for the encoder, but a single dense layer for the decoder which 54 is then reshaped to the final shape [1]. SegNet uses successive convolutions also for the 55 decoder [2]. U-Net introduces extra "skip connections", which consist in concatenating 56 the activation map produced by each encoder layer to the corresponding decoder layer; 57 this improves gradient fluidity and typically the output quality [3]. DeepLab tries to 58 improve the checkerboard effect, sometimes found on segmentation methods, by diluting 59 the distinction between encoder-decoder through the application of atrous convolutions 60 which avoid the need to successively reduce the input image, by instead enlarging the 61 convolution kernels [4].

62
As shown by Table 1, as the input doubles, the number of floating point operations 63 (FLOPs) quadruples. The memory required also increases. This makes applying deep-64 based models to high-resolution images computationally expensive. In areas such as 65  The encoding phase reduces the image into a smaller and more compact, higher-level representation, while the decoder projects that latent representation into a segmentation with the same size as the original image. Four major architectures are considered in this work: fully convolutional network (FCN) [1], SegNet [2], U-Net [3], and DeepLab [4]. FCN is considered a pioneering approach to deep-based image segmentation. It uses successive convolutions for the encoder, but a single dense layer for the decoder that is then reshaped to the final shape [1]. SegNet also uses successive convolutions for the decoder [2]. U-Net introduces extra "skip connections", which consist in concatenating the activation map produced by each encoder layer to the corresponding decoder layer; this improves gradient fluidity and, typically, the output quality [3]. DeepLab tries to improve the checkerboard effect, sometimes found on segmentation methods, by diluting the distinction between the encoder-decoder through the application of atrous convolutions that avoid the need to successively reduce the input image by instead enlarging the convolution kernels [4].
As shown in Table 1, as the input doubles, the number of floating point operations (FLOPs) quadruples. The memory required also increases. This makes applying deepbased models to high-resolution images computationally expensive. In areas such as computational pathology, one of the main limitations is related to the large file size due to the high-resolution (and different magnifications) of the whole slide images. Patch-based methods are therefore common. In these approaches, images are divided into several, smaller patches that are possible for neural networks to process [5].
There is work that focuses on improving training and/or inference time, and it typically involves working with multiple scales, but not on semantic segmentation. Google AI performs alpha matting on mobile devices (i.e., extracting a foreground object) by a two-stage process, due to computational constraints, whereby a neural network performs an initial step, and a secondary network is used only on areas that might require further work [12]. Such an approach has been adapted to depth estimation [13]. For the purpose of image classification, ref. [14] has multiple inputs of varying scales, which are decided by attention mechanisms. Another approach that has been used for classification and object detection is to choose these patches recurrently by reinforcement learning [15,16].

Method
Conventional methods apply the computational effort uniformly across the input space, which seems inefficient. Therefore, our proposal consists of the following steps, which are made clearer by the pseudocode in Algorithm 1: Step 1. Model 1 segments a low-resolution version of the image (line 1); Step 2. the poorly segmented image patches are identified based on the probabilities produced by Model 1 (line 2); and Step 3. Model 2 refines these patches (line 3).

Algorithm 1: Pseudocode of the proposed method.
Input: two models, f (1) and f (2) , and an image input where s ↑ and s ↓ are upscale and downscale interpolations, c i,j crops the (i, j) patch, g is the selection function (Section 3.1), function h produces an uncertainty score for a patch by calculating the average of the uncertainty associated with the probability of each pixel, u(p) = −p log 2 p, so that highly uncertain regions correspond to those with probabilities closest to 0.5.
Notice that both models use the same architecture and receive the same image size, except that Model 1 is a scaled-down version of the image, while Model 2 is a crop of the image. For example, if Model 1 receives one image downscaled by 4×, then Model 2 receives images cropped in 4 × 4 from the original so that the input shape of both models is the same. The full pipeline is illustrated in Figure 3.

Selection Method
At each epoch, while the model is being trained, one patch of each image is sampled by g by using one of the following strategies.

2.
Weighted: Sampling is weighted by the uncertainty produced by Model 1. Shannon entropy is used as a measure of uncertainty by taking the probability map p produced by Model 1, and computing an uncertainty h score. This uncertainty is then normalized and used as the sampling probability.

3.
Highest: The highest uncertainty patch is always selected. While this seems to be the most obvious approach, it also removes some stochasticity and variability from the training of Model 2. The three strategies are experimented with in Section 4.3. After the model has been trained, a threshold is used to select the patches with uncertainty above a certain threshold (typically 0.5).  computational pathology, one of the main limitations is related to the large file size due to 66 the high-resolution (and different magnifications) of the whole slide images. Patch-based 67 methods are therefore common. In these approaches, images are divided into several, 68 smaller patches that are possible for neural networks to process [5].

69
There is work that focuses on improving training and/or inference time, and it typi-70 cally involves working with multiple scales, but not on semantic segmentation. Google 71 AI performs alpha matting on mobile devices (i.e., extracting a foreground object) by a 72 two-stage process, due to computational constraints, whereby a neural network performs 73 an initial step, and a secondary network is used only on areas that might require further 74 work [13]. Such an approach has been adapted to depth estimation [14]. For the purpose 75 of image classification, [15] has multiple inputs of varying scales, which are decided by 76 attention mechanisms. Another approach that has been used for classification and object 77 detection is to choose these patches recurrently by reinforcement learning [16,17].

79
Conventional methods apply the computational effort uniformly across the input 80 space, which seems inefficient. Therefore, our proposal consists of the following steps, 81 which are made clearer by the pseudo-code in Figure 4:

82
Step 1. Model 1 segments a low-resolution version of the image (line 1);

83
Step 2. The poorly segmented image patches are identified based on the probabilities 84 produced by Model 1 (line 2);

86
Notice that both models use the same architecture and receive the same image size, except 87 that Model 1 is a scaled-down version of the image, while Model 2 is a crop of the image. 88 For example, if Model 1 receives one image downscaled by 4×, then Model 2 receives 89 images cropped in 4×4 from the original so that the input shape of both models is the same. 90 The full pipeline is illustrated in Figure 3.

91
Input: two models, f (1) and f (2) , and an image input where s ↑ and s ↓ are upscale and downscale interpolations, c i,j crops the (i, j) patch, g is the selection function (sub-section 3.1), function h produces an uncertainty score for a patch by doing the average of the uncertainty associated with the probability of each pixel, u(p) = −p log 2 p, so that high uncertainty regions correspond to those whose probabilities are closest to 0.5.

Extension
This paper is an expansion from our conference paper [11] and introduces three crucial improvements on the pipeline and expands the experiments: (1) transfer learning is used so that Model 2 is trained on top of Model 1 (black dashed lines in the figure); (2) Model 1 provides context to Model 2 by elementwise addition between the penultimate layer of both models (blue dashed lines); (3) during training for step 2, one patch of each image is selected for Model 2 by using one of three sampling strategies.

Datasets
Four datasets were used: two from the autonomous driving literature (BDD [18] and KITTI [19]), and two from the biomedical literature (BOWL [20] and PH2 [21]), all of which are detailed in Table 2. The biomedical datasets consist of binary classification (BOWL for cell recognition, while PH2 is skin lesion recognition), and the autonomous driving datasets were also used for binary classification (to recognize vehicles). The train-test split was 70-30, except for BDD, which already comes partitioned by the authors.

Experimental Setup
The four semantic segmentation models from the literature discussed in Section 2 are experimented with: FCN [1], SegNet [2], U-Net [3], and DeepLab v3 [4]. All these models share the same backbone (ResNet-50 [17]), which is pretrained in ImageNet [22]. The implementations that came with Torchvision [23] are used for FCN and DeepLab v3, while SegNet and U-Net are implemented by us, also making use of the backbone ResNet-50 from Torchvision.
As typically done in segmentation tasks, focal loss is used for training [24], while the evaluation metric is the Dice coefficient, the segmentation equivalent of the F 1 -score used for classification, which takes class imbalance into account. The optimizer is Adam with a learning rate of 10 −4 that is trained for 20 epochs.
For data augmentation of the aforementioned datasets, the transformations are horizontal flipping, jittering of brightness and contrast by 0.1, and a random translation shift of 10% the size of the image performed by an upscale followed by a crop. All images are resized to 768 × 768.

Results
The main results are presented in Table 3. The four architectures are contrasted and evaluated in the four datasets. The baseline (using a single model in the entire image) is contrasted against the pipeline (using 16 patches, weighted sampling, and an uncertainty threshold of 0.25). The quality of the output is measured by the Dice metric, while time latency is provided for both training (in GPU) and inference (in CPU).
The proposal presents considerable gains in latency time-on average, training time is reduced by 42.0%, and inference time by 74.7%-at the cost of slightly reducing the quality of the segmentation maps produced by 6.3%, on average.
The italicized values in the table will now be subject to four ablation studies.

Ablation Studies
In Table 4, we attempt to estimate the effect of changing several aspects of the pipeline. The architecture has been fixed as DeepLab, and the other parameters are changed. Not surprisingly, some columns predominate (almost fully bold) for reasons now discussed.

1.
The number of patches used to divide the image: There is a general reduction of quality as the number of patches is increased. However, it should be noted that this comes with a gain in latency because the more patches used, the smaller the input sizes, which means lower FLOPs, as previously detailed in Table 1.

2.
The sampling strategy used to select the patches during the training of Model 2: The results of varying the way that patches are selected during the training of Model 2 are based on the uncertainty produced by Model 1. The differences are not considerable, albeit always choosing the highest uncertainty patch or sampling weighted by the normalized uncertainty seem like the best strategies. 3.
The impact of changing the uncertainty threshold with which patches are selected for Model 2 during inference: The threshold chooses the patches from Model 1 to be refined by Model 2. The lower the uncertainty threshold, the more patches will be selected. Clearly, the more patches that are refined by Model 2, the better the final segmentation, naturally at a proportional time cost. 4.
Whether certain additional features from the proposal are relevant: The dashed lines illustrated in the pipeline from Figure 3 are disabled depending on whether Model 1 is used to pretrain Model 2 (fine tune) or whether an activation map is given from Model 1 to Model 2 (context). In both cases, both of these aspects of the pipeline clearly aid in improving the output quality, because disabling them lowers quality.  Table 3).

Conclusions
It seems wasteful for the semantic segmentation model to perform the same computational effort across the pixel space of the image. The current work proposes a two-stage Sensors 2023, 23, 3092 7 of 8 pipeline whereby a rough segmentation is produced by a first model using a low-resolution image, followed by another model that refines the most intricate regions of the image by using patches of the original high-resolution image. These regions are selected based on uncertainty estimates produced by the first model. In the end, both outputs are combined to form the final segmentation.
The approach is validated on four datasets and four architectures. While the proposal reduces output quality by 6%, it reduces training time by 42% and inference time by 75%. This shows that the proposal may show a path for considerable efficiency gains, while work is still required to bridge the gap in terms of output quality. The large gains in efficiency may justify the accuracy penalty for certain applications, such as when using resource-limited microcontrollers for noncritical applications.
In this work, the image is divided into uniformly distributed contiguous patches. In future work, it would be interesting to allow more flexibility in how the image is divided into patches to avoid cases in which an object is cut in half between patches. Furthermore, training both models from end to end would have been desirable, possibly with an attention mechanism.