Incremental and Multi-Task Learning Strategies for Coarse-To-Fine Semantic Segmentation

: The semantic understanding of a scene is a key problem in the computer vision ﬁeld. In this work, we address the multi-level semantic segmentation task where a deep neural network is ﬁrst trained to recognize an initial, coarse, set of a few classes. Then, in an incremental-like approach, it is adapted to segment and label new objects’ categories hierarchically derived from subdividing the classes of the initial set. We propose a set of strategies where the output of coarse classiﬁers is fed to the architectures performing the ﬁner classiﬁcation. Furthermore, we investigate the possibility to predict the different levels of semantic understanding together, which also helps achieve higher accuracy. Experimental results on the New York University Depth v2 (NYUDv2) dataset show promising insights on the multi-level scene understanding.


Introduction
The semantic understanding of a scene is a long standing problem in the computer vision field that can be approached at different levels of interpretations. For instance, in image classification, a single label describing the main object in each image is returned as output. In object detection, instances of particular classes are identified by means of a bounding box which surrounds the objects and a label is assigned to each instance. Semantic segmentation, instead, is a dense labeling task in which a label has to be associated to each single pixel of the image. Similarly, we could interpret the scene at different levels of precision: in some scenarios, for example, it may be enough to identify just few classes while in others a more fine-grained prediction could be required. Moreover, in other settings, a coarse set of classes could be predicted first and then the set of classes could hierarchically grow into more refined categories to better understand the semantic context. To visualize this scenario, imagine an indoor navigation system first trained on a very coarse set of labels to segment, e.g. movable objects, permanent structures, and furniture, in order to, e.g., avoid obstacles. After a while, the dataset used for the initial training could be refined with a more fine-grained set of semantic classes (e.g., the movable objects class could be split into books, monitor, etc) and the task of the robotic system is to interact with these new types of objects. One solution could be to retrain from scratch the underlying neural network with the new set of classes; however, some other solutions may seem more reasonable. For instance, the initial prediction could be helpful for the learning process of the more refined set of classes in the form of an incremental learning approach where new tasks are accomplished at subsequent steps. Furthermore, solving multiple tasks at the same time could be beneficial in terms of both accuracy and the possibility to choose the appropriate set of labels for the particular task at hand (e.g., object avoidance, object interaction, etc.).

Related Work
Semantic segmentation of a scene is a widely explored research problem that remains challenging despite the huge number of proposed approaches. It is one of the most challenging high-level tasks towards the direction of complete scene understanding and it is typically solved by means of deep learning approaches (see [7] for a recent review of the field). Most current state-of-the-art approaches are based on encoder-decoder schemes [8], on the Fully Convolutional Network (FCN) model [9,10], and on residual networks [11]. Some recent well-known and highly performing methods are DilatedNet [12], PSPNet [13], and DeepLab [14]. In particular, an enhanced version of the latter architecture, i.e., the Deeplab V3+ [5], is the architecture employed as starting point for this work.
Recently, due to the diffusion of consumer depth cameras, many datasets containing color and depth data have been created. In this work, we also exploit depth data and we adapt approaches for color images to this scenario with some modifications at the earlier layers. Although this work focuses more on the incremental refinement than at achieving high performance on the stand-alone segmentation task, a brief overview of recent research papers is presented here. In [15,16], a scheme involving CNN at multiple scales has been adopted. Two different CNNs for color and depth and a feature transformation network are exploited in [17]. In [18], a region splitting and merging algorithm for RGB-D data has been proposed. In [19], a MRF superpixel segmentation is combined with a tree-structured segmentation for scene labeling. Multiscale approaches have also been exploited (e.g., [20]). Hierarchical segmentation based on the output of contour extraction has been used in [21], which also deals with object detection from the segmented data. Another combined approach for segmentation and object recognition has been presented in [6], which exploits an initial over-segmentation followed by a hierarchical scheme.
The problem of knowledge transfer in machine learning was first introduced by Bucilua et al. [22] in 2006. They focused on the idea of compressing whole ensemble of models (a collection of models whose final predictions are averaged) into a single and simpler model that is easier and faster to train. This concept was further developed in [23], where the authors tried to solve the problem of adding new classes without losing the model's performance on the older set of classes introducing the idea of knowledge distillation.
Knowledge distillation and incremental learning techniques have not yet been considerably exploited in the task of semantic image segmentation; indeed, previous studies mainly focus on classification and recognition tasks. One of the first works dealing with incremental learning in the semantic segmentation task is [1,24], where the authors applied knowledge distillation to preserve old classes while incrementally enlarging the capabilities of the learning architecture to properly segment and label new sets of classes. However, the task presented here is considerably different in that all the training images are available from the beginning and the sets of labels are progressively split into more fine-grained hierarchical categories and then revealed to the learner.
The last idea we want to investigate is the multiple learning of more representations at the same time, which is somehow related to multitask learning [2][3][4]. Multitask learning has been widely applied to semantic segmentation. For instance, in [16], a single multiscale CNN is employed to solve the three tasks of depth prediction, surface normals estimation, and semantic labeling. In [25], the semantic segmentation task is solved using three networks with shared features: differentiating instances, estimating masks, and categorizing objects. In [26], multitask learning is employed to align the features computed from synthetic data while performing the predictions of the depth, the edges, and the surface normals with the ones computed from real-world images. In [27], the semantic segmentation map is predicted together with instance segmentation and depth prediction; additionally, an approach to select the weights for each loss is proposed.

Proposed Methods
In this section, we show in detail the approaches proposed in this work. Although the proposed procedures are agnostic to the underlying architecture, for the evaluation, we chose the Deeplab V3+ [5,14] network, which has state-of-the-art performance on segmentation tasks. The network consists of a Xception feature extractor, whose weights were pre-trained [28] on the Pascal VOC 2012 dataset [29], and a decoder made by Atrous Spatial Pyramid Pooling (ASPP) layers. We evaluated our results on the NYUDv2 dataset after a pre-processing stage detailed in Section 4.
As with most deep learning approaches, the model takes as input a multi-channel tensor (in our case, nine channels corresponding to the color image, the depth information and the surface normals) and outputs the predicted softmax tensor with a number of channels equal to the number of predicted classes. This operation returns a probability distribution containing for each pixel the probability of belonging to each specific class. The class corresponding to the highest probability value is chosen for each pixel by an argmax operation. As a final result, we end up with the predicted segmentation map where each pixel value is the index of the class it belongs to.
To exploit the multiple types of information, we performed an early fusion of the different representations, i.e., color, depth, and surface normals, as depicted in Figure 1, and then we feed them to the network. More in detail, each input tensor has nine channels. The first three channels correspond to the RGB color representation in the range [−1, 1], i.e., we divided by 127.5 the color values and then subtracted 1. The following three channels correspond to the geometry components where each channel represents the position of each pixel with respect to the three axes (X,Y,Z) of the 3D space. We normalized these values by subtracting the mean and dividing by the standard deviation along each axis. The last three channels represent the surface normal vectors. We used the standard representation with the three components of the unit vector perpendicular to the surface at each location (i.e., the components assume values in the range [−1, 1] and the vector norm is equal to 1). For the training procedures, we employed the Jaccard loss (also known as Intersection over the Union (IoU) loss). We chose this loss since it has proven to be useful when training on a dataset with unbalanced numbers of pixels in the different classes within an image because it gives equal weight to all classes. Additionally, it has shown better perceptual quality than the usual cross-entropy loss with our setup in which there are some small objects and many under-represented labels in the dataset. The Jaccard loss is defined as: where | · | represents the cardinality of the considered set,ŷ is the predicted segmentation map, and y is the ground-truth map.

Hierarchical Learning
In this section, we present and discuss the various methods we designed for knowledge transfer in semantic segmentation.
In the first approach, we used a different Deeplab V3+ model for each step (i.e., on the considered dataset, we have a first model M 1 for the 5-class setting, a second M 2 for the 15-class setting, and a third M 3 for the 41-class setting). As every incremental learning approach, we start by training the M 1 Deeplab V3+ model on the coarser set of classes (e.g., five classes in our scenario). After that, we freeze the first model M 1 and we employ its output of the softmax operation as an additional input component when we train the model M 2 on the set of more fine-grained classes (e.g., 15 classes). We repeat the same approach also when moving from M 2 to M 3 (i.e., from 15 to 41 classes). This methodology was partially derived by the idea presented in [18] where the softmax information is used for binary classification task. Furthermore, notice that, when training for the finer tasks, the networks corresponding to the coarser ones are frozen, i.e., we do not train in a single step a large size network containing the two (or three) networks for the two (or three) tasks but we perform a set of independent trainings each working on a single stage of the network.
More in detail, the number of predicted classes were 4, 13 and 40, respectively, because the unknown and the unlabeled classes were discarded as done by all competing approaches (see Section 4 for further details). Note that the number of trainable parameters remains constant during the two stages because in the incremental step the previous network is completely frozen and not trained anymore. The proposed framework is shown in Figure 2 and it is evident that the previous stage of training acts as a conditioning element for the following one. Indeed, the softmax tensor output from the first training stage (i.e., from M 1 ) serves as additional input (concatenating it with the RGB images) for the second stage (model M 2 ) and the same idea is exploited when moving from M 2 to M 3 . This way the network is constrained to learn the mapping from the coarser to the finer-grained sets of classes. In the second approach, we fed as an additional input to the incremental stages the argmax of the predicted semantic map instead of the softmax. This approach is shown in Figure 3. The main difference from previous approach is that we only feed the index of the maximum of the predicted map and we drop the information about the probabilities of all the various classes which was before represented by the softmax vector for each pixel. In this way, we lose the information about the uncertainty of the prediction but, on the other side, the representation is much more compact having only a single value representing the predicted class for each pixel. Notice that the first approach (softmax) is more complex but leads to slightly better results (see the experimental evaluation in Section 5). On the other side, the second approach (argmax) is faster and simpler even if it has slightly worse performance. In the third approach, we fed as additional input to the incremental stages the edges of the predicted semantic map. This approach is shown in Figure 4. Differently from before, the additional information channel does no longer contains the semantic labels of the classes but instead is represented by the boundary information. In this way, the second stage of training is more focused on the contours of the shapes, which are generally difficult to discriminate in semantic segmentation tasks. We argue that combining multiple cues could lead to further improvements; however, this possibility is limited in practice by memory constraints of the employed GPUs.

Joint Learning of Multiple Representations
Finally, we started to investigate the prediction of different labelings at the same time and whether this could be helpful to improve performance on the coarser set of labels since we are learning more detailed information about their content and vice versa if the coarse labeling can help the fine one. We then designed a different decoder to accomplish the multiple representations and we trained the architecture end-to-end with three different losses (one for each set of labels). In this case, the complete loss function is defined as: where X i is the ith set of labels, i.e., X 1 , X 2 , and X 3 contain, respectively, 5, 15, and 41 classes, while the hyper-parameters λ i balance the three losses. They were empirically set to 1 so that all the terms contribute equally during the back-propagation phase. The loss associated with the set X i is then written as L Jaccard,X i . The approach is illustrated in Figure 5: we used a single standard DeepLab V3+ encoder while the decoder has been modified to be able to deal with the multiple tasks together. From this figure, we can appreciate that the first part of the decoder is shared across all the tasks while the last 1 × 1 convolution layer is unique for each segmentation task (i.e., 4, 13, and 40 classes segmentation in our case, after excluding the unknown and the unlabeled classes as detailed in Section 4). The final 1 × 1 layers are followed by a bilinear upsampling procedure to restore the original input dimensions and a softmax classification layer is then applied to each output to get the final predictions.

Training on the NYUDv2 Dataset
The NYUDv2 dataset [6] was used to train the proposed architectures and to evaluate the performance of the proposed approach. This dataset contains 1449 depth maps and color images of indoor scenes acquired with a first generation Kinect sensor divided into a training set with 795 scenes and a test set with the remaining 654 scenes. The original resolution of the images is 640 × 480; however, for the training procedures, we employed a lower resolution of 560 × 425 for memory constraints. The evaluation of the results, instead, was carried out on the original resolution images for a fair comparison with competing approaches. For results evaluation, we used the ground truth labels from [30], and we considered the three clusters of 5, 15, and 41 labels, respectively, as mapped in [6,31].
In particular, the three considered set of labels are hierarchically represented in Figure 6 where we can appreciate how the derived classes emerge from parent ones. Two classes, i.e., unlabeled and floor, are peculiar because they are never split when moving to finer semantic representations. Similarly, various classes in the set of 15 labels are not split when moving to the finer set of 41. From the diagram, we can notice how there are clear unbalanced splitting situations. For instance, some classes of the split of 15 are underrepresented in the dataset, as can be appreciated from the very low fraction of pixels in these classes from parent ones: e.g., ceiling and window are present in only 2.4% and 6.1%, respectively, of instances of permanent structure; and monitor and books are derived only in the 3.7% and 4.3% of the movable props parent class. Additionally, the splitting is not uniformly distributed among the parent classes, thus from 3 out of 5 classes of the split of 5 derive 13 out of 15 classes of the split of 15. If we move to the analysis of the split of 41, the considerations become even more severe. There are few classes deriving from ≤ 2% of instances of parent class, e.g. bath tub, toilet, night stand, and many others (10 classes) deriving from 2-5%. Moreover, it should be noticed that, in this case, 29 classes out of 41 derive from just three parent classes, thus confirming the extreme inhomogeneity of this splitting. As done by all the competing approaches (e.g., [16,18,31,32]), we removed both from the prediction and from the evaluation of the results the unlabeled and unknown classes when present. Indeed, they are fictitious classes artificially created during the labeling procedures of the images. This choice allowed directly comparing the results with competing approaches, although not related to incremental or multi-tasking learning.
Moreover, in Figure 7, we can appreciate the various level of semantic understanding which have been considered for the evaluation. For instance, in the first row, we can visualize how the generic furniture class in the set of five classes (in yellow) is split into bed and furniture in the set of 15 classes (light blue and blue, respectively) and that bed is further refined into bed and pillows in the set of 41 classes (in orange and light green, respectively). Again, in the second row, for example, we can appreciate how the generic class movable props of the set of five classes (in purple) is then refined to books and object in the set of 15 classes (in dark green and light purple, respectively), and then further refined in the last set of classes. Finally, in the third row, we can visualize how the permanent structure class in the set of five classes (in orange) is then split into the classes ceiling and wall in the set of 15 classes (in yellow and pink, respectively).
The various approaches were trained on the NYUDv2 training set using the three different sets of labels. We employed Stochastic Gradient Descent (SGD) and ran the procedure for 100 epochs. The initial learning rate lr 0 was set to lr 0 = 10 −2 , the weight decay w d to w d = 0.9, and the batch size equal to 2. The learning rate scheduler decreased the learning rate lr every s = 2 epochs using the following formula: where ep denotes the index of the current epoch. We used TensorFlow [33] to develop and train our framework. For each stage, the number of trainable parameters and FLOPs was roughly the same as the original Deeplab V3+ architecture, i.e., 41M and 82B, respectively (the added components require a very small number of parameters with respect to the Deeplab model). The training of the neural network took about 22 h on a NVIDIA Tesla K40 GPU with Intel(R) Core(TM) i7 CPU 970 @ 3.2 Ghz. The implementation of the proposed model is available at https://github.com/LTTM/IL-Coarse2Fine.

Experimental Results
In this section, we discuss the performance of the various proposed approaches in the two different settings of incremental and multi-task learning.
First, we start by comparing our modified Deeplab V3+ architecture with early fusion of the three information representations, i.e., color, depth, and surface normals, with some recent works. To evaluate the results, we employed the most widely used metrics for semantic segmentation problems: the Pixel Accuracy (PA), the mean Class Accuracy (mCA), and the mean Intersection over Union (mIoU) [34].
The modified network is able to obtain state-of-the-art results on all the three set of labels. In Table 1, we can confirm that our baseline network could outperform competing approaches in terms of both PA and mCA on the split of four classes; additionally, we also show the obtained mIoU. Similar results were achieved by our baseline model for the set of 13 classes, as shown in Table 2, while on the set of 40 labels some very recent approaches have better performance (see Table 3). However, notice that the aim of this work is to propose an efficient hierarchical learning strategy, not to improve the performance on the segmentation task by itself.
Then, we evaluated our hierarchical learning approaches. Firstly, we started from the coarser set of four classes and we moved to the prediction of 13 classes: the results are shown as the last three lines of Table 2 for the three different approaches. In this case, we can appreciate that the addition of the softmax information from the four-class model or of the edges information are useful cues to reach higher accuracy on the new set of classes if compared with the baseline counterpart. In particular, in the case of softmax or edges information, there are improvements in all three considered metrics with respect to the baseline Deeplab V3+. In particular, the softmax information leads to the best class accuracy (almost 70%) while the use of edge information is the best strategy with respect to the pixel accuracy (76.3%) and to the mIoU (53.9%). Notice that the mIoU gap with respect to the direct training on the 13 classes is about 2.5% (by the way, this metric and the mCA are more interesting for our task since the pixel accuracy is strongly dependent on large structures such as the floor that are not split in the hierarchical labeling). The argmax information, instead, brings a limited contribution to the final accuracy values. Table 1. Semantic segmentation performances on the NYUDv2 dataset with four classes of the proposed method and of some competing approaches (the table shows percentage values). We underlined the best result among all the methods for each metric, while the best result among the proposed techniques is reported in bold.

PA mCA mIoU
Silberman et al. [6] 59.6 58.6 -Ren et al. [19] 73.0 58.0 -Mȃller et al. [35] 71.9 72.3 -Gupta et al. [36] 78.0 64.0 -Cadena et al. [37] 66.9 65.2 -Stuckler et al. [38] 70.9 65.0 -Couprie et al. [31] 64.5 63.5 -Eigen et al. [16] 83. One may argue what would happen if we train both the first and the second stage with the same set of classes (i.e., by just using a deeper network without really exploiting the hierarchical structure of the classes). We expect this scenario to achieve almost the same results of our baseline approach, or slightly higher, since we are retraining the same architecture with an additional input, which is the output of the previously trained network. At the same time, we expect that the incremental framework is the dominant factor for the performance increase. Indeed, the results for this stacking are perfectly in line with our intuition, as reported in Table 2 with the name "Stacking". Table 2. Semantic segmentation performances on the NYUDv2 dataset with 13 classes of the proposed methods and of some competing approaches (the table shows percentage values). We underline the best result among all the methods for each metric, while the best result among the proposed techniques is reported in bold. Furthermore, we performed an additional incremental step to predict the set of 40 classes starting from the prediction of the set of 13 labels. The results are reported in Table 3. In this case, our method was outperformed by some methods in the literature, due to some inner limitations of the employed Deeplab V3+ architecture. However, the most interesting thing for this work is the comparison with our baseline method, i.e., Deeplab V3+ directly trained on the 40 classes, in order to appreciate the gain of the hierarchical approaches.

Method
We can appreciate how the three additional cues employed produce some improvements in the various metrics even if the gain is more limited. The result is still noticeable if we remember that the splitting is highly unbalanced and inhomogeneous as we have seen in Section 4. Table 3. Semantic segmentation performances on the NYUDv2 dataset with 40 classes of the proposed methods and of some competing approaches (the table shows percentage values). We underline the best result among all the methods for each metric, while the best result among the proposed techniques is reported in bold.

PA mCA mIoU
Silberman et al. [6] 54. 6 19.0 -Ren et al. [19] 49. Finally, in Table 4, we evaluate our joint learning approach on the three sets of classes simultaneously. We can appreciate that the joint model allows not only to predict the three sets of labels at the same time, without the need for multiple training stages, but also to improve the accuracy with respect to the baseline in all the scenarios and for all the metrics. The improvement, although consistent across all experiments and metrics, is sometimes modest and smaller than some of the previously proposed methods. The highest gains are achieved in the PA and mIoU for the set of four classes and in the PA for the set of 40 classes. It should be noticed that the chosen architecture is already highly performing, especially on the sets of 4 and 13 classes.  Figure 8 shows some qualitative results for the set of five classes. Here, we can compare the performance of our baseline Deeplab V3+ with respect to the multi-tasking approach (the other methods do not apply to this setting since there is no coarser representation). As already noticed, both the approaches have very high accuracy on this task. The image in the first column is very similar between the two approaches; however, we can verify how the multi-tasking learning outperforms the baseline in the remaining four images. In particular, look at the purple object on the top left of the figure in second column, the orange top of the furniture on the left, or again to the purple objects on the center-right of the image. In Column 3, we can clearly see an improvement in the definition of the shapes of the sofa and of the pillows. Similarly, the wall and the furniture are better recognized in the last two images. unlabeled floor perm structure furniture movable props Finally, we report in Figure 9 the qualitative results for the split of 15 classes for all the proposed approaches. From the figure, we can appreciate that the incremental approach with the softmax or the edges generally lead to a much cleaner prediction with few artifacts. This can be seen particularly well in Column 1, where the chair in the center of the image is fully recovered by the proposed incremental methods, and in Column 3, where the wall is cleaned from prediction errors (in this scene, the sofa has also been properly segmented but a wrong label of bed has been associated to it). In the same scenes, the baseline, the incremental approach with the argmax, and the multi-tasking suffer from some artifacts. In Column 4, we can notice that the edges information revealed to be more significant than the softmax information to properly recognize the leg of the pool table.
In general, we argue that the incremental approaches with edges or softmax information are much more reliable than the conditioning based on the argmax: in the softmax case, a larger amount of information is fed as additional input giving the network the possibility to discriminate between certain or uncertain predictions while in the edges case the network is forced to focus more on the edges of the objects, which typically represent one of the most challenging characteristics to be learned.
For what concern computational requirements, the inference time of the modified Deeplab V3+ network is 23 ms on the workstation used for the training (with an Intel 970 CPU and a Nvidia K40 GPU), which is roughly the same as the standard DeepLab V3+ architecture. Notice that incremental schemes require multiple inferences, e.g., the 13 classes experiment requires executing the model two times in cascade (one for the four-class network and one for the 13-class network taking in input also the outcome of the four-class model). If real-time performance were needed, the best option would be the joint learning scheme proposed in Section 3.2 that is able to perform all three tasks with a single pass through the encoder module that is the most computationally demanding stage being the decoder very lightweight. As a comparison, current works (e.g., [37,38]) report a higher computation time of 1.02 s and 78 ms, even if a direct comparison is not possible due to the completely different hardware setups.

Conclusions
In this paper, we introduce and address the novel problem of hierarchical incremental learning where a first deep neural network is trained on a small set of macro-classes and is then adapted and refined to recognize a larger set of classes with a finer semantic content. We propose three different hierarchical strategies exploiting the softmax and argmax of the coarse network output and the edges information from the segmentation maps of the coarse network. Furthermore, a scheme for the joint training on the three tasks is also proposed. Experimental results show that all the proposed schemes allow improving the performance with respect to the direct training on the larger set of classes.
Further research will be devoted to improving the multitask approach, eventually also with the addition of different tasks (e.g., the prediction of the surface normals or of the depth maps without using them as inputs). Given that the proposed methodologies are agnostic to the underlying neural network architecture, we will consider employing a more lightweight network that allows combining multiple cues together.