1. Introduction
The semantic understanding of a scene is a long standing problem in the computer vision field that can be approached at different levels of interpretations. For instance, in image classification, a single label describing the main object in each image is returned as output. In object detection, instances of particular classes are identified by means of a bounding box which surrounds the objects and a label is assigned to each instance. Semantic segmentation, instead, is a dense labeling task in which a label has to be associated to each single pixel of the image. Similarly, we could interpret the scene at different levels of precision: In some scenarios, for example, it may be enough to identify just few classes while in others a more fine-grained prediction could be required. Moreover, in other settings, a coarse set of classes could be predicted first and then the set of classes could hierarchically grow into more refined categories to better understand the semantic context. To visualize this scenario, imagine an indoor navigation system first trained on a very coarse set of labels to segment, e.g. movable objects, permanent structures, and furniture, in order to, e.g., avoid obstacles. After a while, the dataset used for the initial training could be refined with a more fine-grained set of semantic classes (e.g., the movable objects class could be split into books, monitor, etc) and the task of the robotic system is to interact with these new types of objects. One solution could be to retrain from scratch the underlying neural network with the new set of classes; however, some other solutions may seem more reasonable. For instance, the initial prediction could be helpful for the learning process of the more refined set of classes in the form of an incremental learning approach where new tasks are accomplished at subsequent steps. Furthermore, solving multiple tasks at the same time could be beneficial in terms of both accuracy and the possibility to choose the appropriate set of labels for the particular task at hand (e.g., object avoidance, object interaction, etc.).
Starting from these considerations, the main goal of this work is to transfer previously gained knowledge, acquired on a simple semantic segmentation task with coarse classes (e.g., structures, furniture, objects, etc.), to a new model where more fine-grained and detailed semantic classes (e.g., walls, beds, cups, etc.) are introduced.
Notice that this task is similar to incremental learning; however, there are a few key differences. With respect to incremental learning for semantic segmentation [
1], here we do not add the ability to segment and label new classes; instead, we refine the initial prediction on a coarse set of classes with a more fine-grained set of classes originated from the previous one. In this sense, indeed, we expand the ability of the deep neural network to accomplish the new fine-grained task.
The last focus of this work is the development of a new approach based on the simultaneous output of multi-level semantic maps. By exploiting domain sharing at the feature extraction level, the model will be fine-tuned to learn different levels of semantic labeling at the same time. This is somehow related to multitask learning [
2,
3,
4], where multiple tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Training on multiple tasks could even bring additional information and, by sharing the representations between related tasks, we allow the model to generalize better on one or more tasks. Additionally, there is a reduction in the computational time with respect to training one independent architecture for each task. Our approach, however, is not strictly speaking multitask learning because we do not predict multiple different tasks at the same time (e.g., semantic maps along with depth information, surface normals, etc.) but actually we learn multiple sets of classes at the same time allowing to have different levels of representations of the semantic map simultaneously, greatly reducing the time required for the training (a single training replaces multiple steps of training). Furthermore, we share the same data across the tasks, which actually represent a different hierarchical clustering of the labels.
We based our approach on an end-to-end deep learning pipeline for semantic segmentation on color and depth data (i.e., color representation with an associated depth information) based on the DeepLab V3+ model [
5]. To train the network and to evaluate the results, we employed the NYUDv2 dataset [
6], which consists of a set of scenes of the indoor environment, with three sets of labels (5, 15 and 41 classes, respectively) at different levels of semantic precision and we compared our results with other recent methodologies, although not related to incremental or multitask approaches.
The remainder of this paper is organized as follows.
Section 2 presents an overview of related work. In
Section 3, the proposed methodologies are introduced. The employed dataset and training procedures are described in
Section 4, while the experimental results are discussed in
Section 5. Finally,
Section 6 concludes our work and outlines some future developments.
2. Related Work
Semantic segmentation of a scene is a widely explored research problem that remains challenging despite the huge number of proposed approaches. It is one of the most challenging high-level tasks towards the direction of complete scene understanding and it is typically solved by means of deep learning approaches (see [
7] for a recent review of the field). Most current state-of-the-art approaches are based on encoder–decoder schemes [
8], on the Fully Convolutional Network (FCN) model [
9,
10], and on residual networks [
11]. Some recent well-known and highly performing methods are DilatedNet [
12], PSPNet [
13], and DeepLab [
14]. In particular, an enhanced version of the latter architecture, i.e., the Deeplab V3+ [
5], is the architecture employed as starting point for this work.
Recently, due to the diffusion of consumer depth cameras, many datasets containing color and depth data have been created. In this work, we also exploit depth data and we adapt approaches for color images to this scenario with some modifications at the earlier layers. Although this work focuses more on the incremental refinement than at achieving high performance on the stand-alone segmentation task, a brief overview of recent research papers is presented here. In [
15,
16], a scheme involving CNN at multiple scales has been adopted. Two different CNNs for color and depth and a feature transformation network are exploited in [
17]. In [
18], a region splitting and merging algorithm for RGB-D data has been proposed. In [
19], a MRF superpixel segmentation is combined with a tree-structured segmentation for scene labeling. Multiscale approaches have also been exploited (e.g., [
20]). Hierarchical segmentation based on the output of contour extraction has been used in [
21], which also deals with object detection from the segmented data. Another combined approach for segmentation and object recognition has been presented in [
6], which exploits an initial over-segmentation followed by a hierarchical scheme.
The problem of knowledge transfer in machine learning was first introduced by Bucilua et al. [
22] in 2006. They focused on the idea of compressing whole ensemble of models (a collection of models whose final predictions are averaged) into a single and simpler model that is easier and faster to train. This concept was further developed in [
23], where the authors tried to solve the problem of adding new classes without losing the model’s performance on the older set of classes introducing the idea of knowledge distillation.
Knowledge distillation and incremental learning techniques have not yet been considerably exploited in the task of semantic image segmentation; indeed, previous studies mainly focus on classification and recognition tasks. One of the first works dealing with incremental learning in the semantic segmentation task is [
1,
24], where the authors applied knowledge distillation to preserve old classes while incrementally enlarging the capabilities of the learning architecture to properly segment and label new sets of classes. However, the task presented here is considerably different in that all the training images are available from the beginning and the sets of labels are progressively split into more fine-grained hierarchical categories and then revealed to the learner.
The last idea we want to investigate is the multiple learning of more representations at the same time, which is somehow related to multitask learning [
2,
3,
4]. Multitask learning has been widely applied to semantic segmentation. For instance, in [
16], a single multiscale CNN is employed to solve the three tasks of depth prediction, surface normals estimation, and semantic labeling. In [
25], the semantic segmentation task is solved using three networks with shared features: Differentiating instances, estimating masks, and categorizing objects. In [
26], multitask learning is employed to align the features computed from synthetic data while performing the predictions of the depth, the edges, and the surface normals with the ones computed from real-world images. In [
27], the semantic segmentation map is predicted together with instance segmentation and depth prediction; additionally, an approach to select the weights for each loss is proposed.
3. Proposed Methods
In this section, we show in detail the approaches proposed in this work. Although the proposed procedures are agnostic to the underlying architecture, for the evaluation, we chose the Deeplab V3+ [
5,
14] network, which has state-of-the-art performance on segmentation tasks. The network consists of a Xception feature extractor, whose weights were pre-trained [
28] on the Pascal VOC 2012 dataset [
29], and a decoder made by Atrous Spatial Pyramid Pooling (ASPP) layers. We evaluated our results on the NYUDv2 dataset after a pre-processing stage detailed in
Section 4.
As with most deep learning approaches, the model takes as input a multi-channel tensor (in our case, nine channels corresponding to the color image, the depth information and the surface normals) and outputs the predicted softmax tensor with a number of channels equal to the number of predicted classes. This operation returns a probability distribution containing for each pixel the probability of belonging to each specific class. The class corresponding to the highest probability value is chosen for each pixel by an argmax operation. As a final result, we end up with the predicted segmentation map where each pixel value is the index of the class it belongs to.
To exploit the multiple types of information, we performed an early fusion of the different representations, i.e., color, depth, and surface normals, as depicted in
Figure 1, and then we feed them to the network. More in detail, each input tensor has nine channels. The first three channels correspond to the RGB color representation in the range
, i.e., we divided by 127.5 the color values and then subtracted 1. The following three channels correspond to the geometry components where each channel represents the position of each pixel with respect to the three axes (X,Y,Z) of the 3D space. We normalized these values by subtracting the mean and dividing by the standard deviation along each axis. The last three channels represent the surface normal vectors. We used the standard representation with the three components of the unit vector perpendicular to the surface at each location (i.e., the components assume values in the range
and the vector norm is equal to 1).
For the training procedures, we employed the Jaccard loss (also known as Intersection over the Union (IoU) loss). We chose this loss since it has proven to be useful when training on a dataset with unbalanced numbers of pixels in the different classes within an image because it gives equal weight to all classes. Additionally, it has shown better perceptual quality than the usual cross-entropy loss with our setup in which there are some small objects and many under-represented labels in the dataset. The Jaccard loss is defined as:
where
represents the cardinality of the considered set,
is the predicted segmentation map, and
y is the ground-truth map.
3.1. Hierarchical Learning
In this section, we present and discuss the various methods we designed for knowledge transfer in semantic segmentation.
In the first approach, we used a different Deeplab V3+ model for each step (i.e., on the considered dataset, we have a first model
for the 5-class setting, a second
for the 15-class setting, and a third
for the 41-class setting). As every incremental learning approach, we start by training the
Deeplab V3+ model on the coarser set of classes (e.g., five classes in our scenario). After that, we freeze the first model
and we employ its output of the softmax operation as an additional input component when we train the model
on the set of more fine-grained classes (e.g., 15 classes). We repeat the same approach also when moving from
to
(i.e., from 15 to 41 classes). This methodology was partially derived by the idea presented in [
18] where the softmax information is used for binary classification task. Furthermore, notice that, when training for the finer tasks, the networks corresponding to the coarser ones are frozen, i.e., we do not train in a single step a large size network containing the two (or three) networks for the two (or three) tasks but we perform a set of independent trainings each working on a single stage of the network.
More in detail, the number of predicted classes were 4, 13 and 40, respectively, because the
unknown and the
unlabeled classes were discarded as done by all competing approaches (see
Section 4 for further details). Note that the number of trainable parameters remains constant during the two stages because in the incremental step the previous network is completely frozen and not trained anymore. The proposed framework is shown in
Figure 2 and it is evident that the previous stage of training acts as a conditioning element for the following one. Indeed, the softmax tensor output from the first training stage (i.e., from
) serves as additional input (concatenating it with the RGB images) for the second stage (model
) and the same idea is exploited when moving from
to
. This way the network is constrained to learn the mapping from the coarser to the finer-grained sets of classes.
In the second approach, we fed as an additional input to the incremental stages the argmax of the predicted semantic map instead of the softmax. This approach is shown in
Figure 3. The main difference from previous approach is that we only feed the index of the maximum of the predicted map and we drop the information about the probabilities of all the various classes which was before represented by the softmax vector for each pixel. In this way, we lose the information about the uncertainty of the prediction but, on the other side, the representation is much more compact having only a single value representing the predicted class for each pixel. Notice that the first approach (softmax) is more complex but leads to slightly better results (see the experimental evaluation in
Section 5). On the other side, the second approach (argmax) is faster and simpler even if it has slightly worse performance.
In the third approach, we fed as additional input to the incremental stages the edges of the predicted semantic map. This approach is shown in
Figure 4. Differently from before, the additional information channel does no longer contains the semantic labels of the classes but instead is represented by the boundary information. In this way, the second stage of training is more focused on the contours of the shapes, which are generally difficult to discriminate in semantic segmentation tasks.
We argue that combining multiple cues could lead to further improvements; however, this possibility is limited in practice by memory constraints of the employed GPUs.
3.2. Joint Learning of Multiple Representations
Finally, we started to investigate the prediction of different labelings at the same time and whether this could be helpful to improve performance on the coarser set of labels since we are learning more detailed information about their content and vice versa if the coarse labeling can help the fine one. We then designed a different decoder to accomplish the multiple representations and we trained the architecture end-to-end with three different losses (one for each set of labels). In this case, the complete loss function is defined as:
where
is the
ith set of labels, i.e.,
,
, and
contain, respectively, 5, 15, and 41 classes, while the hyper-parameters
balance the three losses. They were empirically set to 1 so that all the terms contribute equally during the back-propagation phase. The loss associated with the set
is then written as
. The approach is illustrated in
Figure 5: We used a single standard DeepLab V3+ encoder while the decoder has been modified to be able to deal with the multiple tasks together. From this figure, we can appreciate that the first part of the decoder is shared across all the tasks while the last
convolution layer is unique for each segmentation task (i.e., 4, 13, and 40 classes segmentation in our case, after excluding the
unknown and the
unlabeled classes as detailed in
Section 4). The final
layers are followed by a bilinear upsampling procedure to restore the original input dimensions and a softmax classification layer is then applied to each output to get the final predictions.
4. Training on the NYUDv2 Dataset
The NYUDv2 dataset [
6] was used to train the proposed architectures and to evaluate the performance of the proposed approach. This dataset contains 1449 depth maps and color images of indoor scenes acquired with a first generation Kinect sensor divided into a training set with 795 scenes and a test set with the remaining 654 scenes. The original resolution of the images is
; however, for the training procedures, we employed a lower resolution of
for memory constraints. The evaluation of the results, instead, was carried out on the original resolution images for a fair comparison with competing approaches. For results evaluation, we used the ground truth labels from [
30], and we considered the three clusters of 5, 15, and 41 labels, respectively, as mapped in [
6,
31].
In particular, the three considered set of labels are hierarchically represented in
Figure 6 where we can appreciate how the derived classes emerge from parent ones. Two classes, i.e.,
unlabeled and
floor, are peculiar because they are never split when moving to finer semantic representations. Similarly, various classes in the set of 15 labels are not split when moving to the finer set of 41. From the diagram, we can notice how there are clear unbalanced splitting situations. For instance, some classes of the split of 15 are underrepresented in the dataset, as can be appreciated from the very low fraction of pixels in these classes from parent ones: e.g.,
ceiling and
window are present in only
and
, respectively, of instances of
permanent structure; and
monitor and
books are derived only in the
and
of the
movable props parent class. Additionally, the splitting is not uniformly distributed among the parent classes, thus from 3 out of 5 classes of the split of 5 derive 13 out of 15 classes of the split of 15. If we move to the analysis of the split of 41, the considerations become even more severe. There are few classes deriving from
of instances of parent class, e.g.
bath tub,
toilet,
night stand, and many others (10 classes) deriving from 2–5%. Moreover, it should be noticed that, in this case, 29 classes out of 41 derive from just three parent classes, thus confirming the extreme inhomogeneity of this splitting.
As done by all the competing approaches (e.g., [
16,
18,
31,
32]), we removed both from the prediction and from the evaluation of the results the
unlabeled and
unknown classes when present. Indeed, they are fictitious classes artificially created during the labeling procedures of the images. This choice allowed directly comparing the results with competing approaches, although not related to incremental or multi-tasking learning.
Moreover, in
Figure 7, we can appreciate the various level of semantic understanding which have been considered for the evaluation. For instance, in the first row, we can visualize how the generic
furniture class in the set of five classes (in yellow) is split into
bed and
furniture in the set of 15 classes (light blue and blue, respectively) and that
bed is further refined into
bed and
pillows in the set of 41 classes (in orange and light green, respectively). Again, in the second row, for example, we can appreciate how the generic class
movable props of the set of five classes (in purple) is then refined to
books and
object in the set of 15 classes (in dark green and light purple, respectively), and then further refined in the last set of classes. Finally, in the third row, we can visualize how the
permanent structure class in the set of five classes (in orange) is then split into the classes
ceiling and
wall in the set of 15 classes (in yellow and pink, respectively).
The various approaches were trained on the NYUDv2 training set using the three different sets of labels. We employed Stochastic Gradient Descent (SGD) and ran the procedure for 100 epochs. The initial learning rate
was set to
, the weight decay
to
, and the batch size equal to 2. The learning rate scheduler decreased the learning rate
every
epochs using the following formula:
where
ep denotes the index of the current epoch.
We used TensorFlow [
33] to develop and train our framework. For each stage, the number of trainable parameters and FLOPs was roughly the same as the original Deeplab V3+ architecture, i.e., 41M and 82B, respectively (the added components require a very small number of parameters with respect to the Deeplab model). The training of the neural network took about 22 h on a NVIDIA Tesla K40 GPU with Intel(R) Core(TM) i7 CPU 970 @ 3.2 Ghz. The implementation of the proposed model is available at
https://github.com/LTTM/IL-Coarse2Fine.
5. Experimental Results
In this section, we discuss the performance of the various proposed approaches in the two different settings of incremental and multi-task learning.
First, we start by comparing our modified Deeplab V3+ architecture with early fusion of the three information representations, i.e., color, depth, and surface normals, with some recent works. To evaluate the results, we employed the most widely used metrics for semantic segmentation problems: The Pixel Accuracy (PA), the mean Class Accuracy (mCA), and the mean Intersection over Union (mIoU) [
34].
The modified network is able to obtain state-of-the-art results on all the three set of labels. In
Table 1, we can confirm that our baseline network could outperform competing approaches in terms of both PA and mCA on the split of four classes; additionally, we also show the obtained mIoU. Similar results were achieved by our baseline model for the set of 13 classes, as shown in
Table 2, while on the set of 40 labels some very recent approaches have better performance (see
Table 3). However, notice that the aim of this work is to propose an efficient hierarchical learning strategy, not to improve the performance on the segmentation task by itself.
Then, we evaluated our hierarchical learning approaches. Firstly, we started from the coarser set of four classes and we moved to the prediction of 13 classes: The results are shown as the last three lines of
Table 2 for the three different approaches. In this case, we can appreciate that the addition of the softmax information from the four-class model or of the edges information are useful cues to reach higher accuracy on the new set of classes if compared with the baseline counterpart. In particular, in the case of softmax or edges information, there are improvements in all three considered metrics with respect to the baseline Deeplab V3+. In particular, the softmax information leads to the best class accuracy (almost
) while the use of edge information is the best strategy with respect to the pixel accuracy (
) and to the mIoU (
). Notice that the mIoU gap with respect to the direct training on the 13 classes is about
(by the way, this metric and the mCA are more interesting for our task since the pixel accuracy is strongly dependent on large structures such as the floor that are not split in the hierarchical labeling). The argmax information, instead, brings a limited contribution to the final accuracy values.
One may argue what would happen if we train both the first and the second stage with the same set of classes (i.e., by just using a deeper network without really exploiting the hierarchical structure of the classes). We expect this scenario to achieve almost the same results of our baseline approach, or slightly higher, since we are retraining the same architecture with an additional input, which is the output of the previously trained network. At the same time, we expect that the incremental framework is the dominant factor for the performance increase. Indeed, the results for this stacking are perfectly in line with our intuition, as reported in
Table 2 with the name “Stacking”.
Furthermore, we performed an additional incremental step to predict the set of 40 classes starting from the prediction of the set of 13 labels. The results are reported in
Table 3. In this case, our method was outperformed by some methods in the literature, due to some inner limitations of the employed Deeplab V3+ architecture. However, the most interesting thing for this work is the comparison with our baseline method, i.e., Deeplab V3+ directly trained on the 40 classes, in order to appreciate the gain of the hierarchical approaches.
We can appreciate how the three additional cues employed produce some improvements in the various metrics even if the gain is more limited. The result is still noticeable if we remember that the splitting is highly unbalanced and inhomogeneous as we have seen in
Section 4.
Finally, in
Table 4, we evaluate our joint learning approach on the three sets of classes simultaneously. We can appreciate that the joint model allows not only to predict the three sets of labels at the same time, without the need for multiple training stages, but also to improve the accuracy with respect to the baseline in all the scenarios and for all the metrics. The improvement, although consistent across all experiments and metrics, is sometimes modest and smaller than some of the previously proposed methods. The highest gains are achieved in the PA and mIoU for the set of four classes and in the PA for the set of 40 classes. It should be noticed that the chosen architecture is already highly performing, especially on the sets of 4 and 13 classes.
Figure 8 shows some qualitative results for the set of five classes. Here, we can compare the performance of our baseline Deeplab V3+ with respect to the multi-tasking approach (the other methods do not apply to this setting since there is no coarser representation). As already noticed, both the approaches have very high accuracy on this task. The image in the first column is very similar between the two approaches; however, we can verify how the multi-tasking learning outperforms the baseline in the remaining four images. In particular, look at the purple object on the top left of the figure in second column, the orange top of the
furniture on the left, or again to the purple objects on the center-right of the image. In Column 3, we can clearly see an improvement in the definition of the shapes of the sofa and of the pillows. Similarly, the
wall and the
furniture are better recognized in the last two images.
Finally, we report in
Figure 9 the qualitative results for the split of 15 classes for all the proposed approaches. From the figure, we can appreciate that the incremental approach with the softmax or the edges generally lead to a much cleaner prediction with few artifacts. This can be seen particularly well in Column 1, where the
chair in the center of the image is fully recovered by the proposed incremental methods, and in Column 3, where the
wall is cleaned from prediction errors (in this scene, the
sofa has also been properly segmented but a wrong label of
bed has been associated to it). In the same scenes, the baseline, the incremental approach with the argmax, and the multi-tasking suffer from some artifacts. In Column 4, we can notice that the edges information revealed to be more significant than the softmax information to properly recognize the leg of the pool table.
In general, we argue that the incremental approaches with edges or softmax information are much more reliable than the conditioning based on the argmax: In the softmax case, a larger amount of information is fed as additional input giving the network the possibility to discriminate between certain or uncertain predictions while in the edges case the network is forced to focus more on the edges of the objects, which typically represent one of the most challenging characteristics to be learned.
For what concern computational requirements, the inference time of the modified Deeplab V3+ network is 23 ms on the workstation used for the training (with an Intel 970 CPU and a Nvidia K40 GPU), which is roughly the same as the standard DeepLab V3+ architecture. Notice that incremental schemes require multiple inferences, e.g., the 13 classes experiment requires executing the model two times in cascade (one for the four-class network and one for the 13-class network taking in input also the outcome of the four-class model). If real-time performance were needed, the best option would be the joint learning scheme proposed in
Section 3.2 that is able to perform all three tasks with a single pass through the encoder module that is the most computationally demanding stage being the decoder very lightweight. As a comparison, current works (e.g., [
37,
38]) report a higher computation time of
s and 78 ms, even if a direct comparison is not possible due to the completely different hardware setups.