A Multi-Level Approach to Waste Object Segmentation

We address the problem of localizing waste objects from a color image and an optional depth image, which is a key perception component for robotic interaction with such objects. Specifically, our method integrates the intensity and depth information at multiple levels of spatial granularity. Firstly, a scene-level deep network produces an initial coarse segmentation, based on which we select a few potential object regions to zoom in and perform fine segmentation. The results of the above steps are further integrated into a densely connected conditional random field that learns to respect the appearance, depth, and spatial affinities with pixel-level accuracy. In addition, we create a new RGBD waste object segmentation dataset, MJU-Waste, that is made public to facilitate future research in this area. The efficacy of our method is validated on both MJU-Waste and the Trash Annotation in Context (TACO) dataset.


Introduction
Waste objects are commonly found in both indoor and outdoor environments such as household, office or road scenes. As such, it is important for a vision-based intelligent robot to localize and interact with them. However, detecting and segmenting waste objects are much more challenging than most other objects. For example, waste objects could either be incomplete or damaged, or both. In many cases, their presence could only be inferred from scene-level contexts, e.g., via reasoning about their contrast to the background and judging by their intended utilities. On the other hand, one key challenge to accurately localizing waste objects is the extreme scale variation resulting from the variable physical sizes and the dynamic perspectives, as shown in Figure 1. Due to the large number of small objects, it is difficult even for most humans to accurately delineate waste object boundaries without zooming in to see the appearance details clearly. For the human vision system, however, attention can either be shifted to cover a wide area of the visual field, or narrowed to a tiny region as when we scrutinize a small area for details (e.g., [1][2][3][4]). Presented with an image, we can immediately recognize the meaning of the scene and the global structure, which allow us to easily spot objects of interest. We can consequently attend to those object regions to perform fine-grained delineation. Inspired by how the human vision system works, we solve the waste object segmentation problem in a similar manner by integrating visual cues from multiple levels of spatial granularity. The general idea of exploiting objectness has long been proven effective for a wide range of vision-based applications [6][7][8][9]. In particular, several works have already demonstrated that objectness reasoning can positively impact semantic segmentation [10][11][12][13]. However, in this work we propose a simple yet effective strategy for waste object proposal that neither require pretrained objectness models nor additional object or part annotations. Our primary goal is to address the extreme scale variation which is much less common in generic objects. In order to obtain accurate and coherent segmentation results, our method performs joint inference at three levels. Firstly, we obtain a coarse segmentation at the scene-level to capture the global context and to propose potential object regions. We note that our simple object region proposal strategy captures the objectness priors reasonably well in practice. This is followed by an object-level segmentation to recover fine structural details for each object region proposal. In particular, adopting two separate models at the scene and object levels respectively allows us to disentangle the learning of the global image contexts from the learning of the fine object boundary details. Finally, we perform joint inference to integrate results from both the scene and object levels, as well as making pixel-level refinements based on color, depth, and spatial affinities. The main steps are summarized and illustrated in Figure 2. We obtain significantly superior results with our method, greatly surpassing a number of strong semantic segmentation baselines.
Recent years witnessed a huge success of deep learning in a wide spectrum of vision-based perception tasks [14][15][16][17]. In this work, we would also like to harness the powerful learning capabilities of convolutional neural network (CNN) models to address the waste object segmentation problem. Most state-of-the-art CNN-based segmentation models exploit the spatial-preserving properties of fully convolutional networks [17] to directly learn feature representations that could translate into class probability maps either at the scene level (i.e., semantic segmentation) or the object level (i.e., instance segmentation). One of the key limitations when it comes to applying these general-purpose models directly for waste object segmentation is that they are unable to handle the extreme object scale variation due to the delicate contention between global semantics and accurate localization under a fixed feature resolution, and the resulting segmentation can be inaccurate for the abundant small objects with complex shape details. Based upon this observation, we propose to learn a multi-level model that allows us to adaptively zoom into object regions to recover fine structural details, while retaining a scene-level model to capture the long-range context and to provide object proposals. Furthermore, such a layered model can be jointly reasoned with pixel-level refinements under a unified Conditional Random Field (CRF) [18]   Given an input image, we approach the waste object segmentation problem at three levels: (i) scene-level parsing for an initial coarse segmentation, (ii) object-level parsing to recover fine details for each object region proposal, and (iii) pixel-level refinement based on color, depth, and spatial affinities. Together, joint inference at all these levels produces coherent final segmentation results.
The main contributions of our work are three-fold. Firstly, we propose a deep-learning based waste object segmentation framework that integrates scene-level and object-level reasoning. In particular, our method does not require additional object-level annotations. By virtue of a simple object region proposal method, we are able to learn separate scene-level and object-level segmentation models that allow us to achieve accurate localization while preserving the strong global contextual semantics. Secondly, we develop a strategy based on densely connected CRF [19] to perform joint inference at the scene, object, and pixel levels to produce a highly accurate and coherent final segmentation. In addition to the scene and object level parsing, our CRF model further refines the segmentation results with appearance, depth, and spatial affinity pairwise terms. Importantly, this CRF model is also amenable to a filtering-based efficient inference. Finally, we collected and annotated a new RGBD [20] dataset, MJU-Waste, for waste object segmentation. We believe our dataset is the first public RGBD dataset for this task. Furthermore, we evaluate our method on the TACO dataset [5], which is another public waste object segmentation benchmark. To the best of our knowledge, our work is among the first in the literature to address waste object segmentation on public datasets. Experiments on both datasets verify that our method can be used as a general framework to improve the performance of a wide range of deep models such as FCN [17], PSPNet [21], CCNet [22] and DeepLab [23].
We note that the focus of this work is to obtain accurate waste object boundary delineation. Another closely related and also very challenging task is waste object detection and classification. Ultimately, we would like to solve for waste instance segmentation with fine-grained class information. However, existing datasets do not provide a large number of object classes with sufficient training data. In addition, differentiating waste instances under a single class label is also challenging. For example, the best Average Precision (AP) obtained in [5] are in the 20s for the TACO-1 classless litter detection task where the goal is to detect and segment litter items with a single class label. Therefore, in this paper we adopt a research methodology under which we gradually move toward richer models while maintaining a high level of performance. In this regard, we formulate our problem as a two-class (waste vs. background) semantic segmentation one. This allows us to obtain high quality segmentation results as we demonstrate with our experiments.
In the remainder of this paper, Section 2 briefly reviews the literature on waste object segmentation and related tasks, as well as recent progress in semantic segmentation. We then describe details of our method in Section 3. Afterwards, Section 4 presents findings from our experimental evaluation, followed by closing remarks in Section 5.

Waste Object Segmentation and Related Tasks
The ability to automatically detect, localize and classify waste objects is of wide interest in the computer and robotic vision community. However, there are relatively limited works in the literature that address the specific task of waste object segmentation. We believe this is partially due to the poor availability of public waste segmentation datasets until very recently. Therefore, in this paper we propose the MJU-Waste dataset with 2475 RGBD images each annotated with a pixelwise waste object mask. To facilitate future research, we make our dataset publicly available. To the best of our knowledge, this is the only public dataset of this kind in addition to the TACO dataset [5] of 1500 color images. Below we briefly review some recent works on waste object classification, detection and segmentation which are closely related ours.
Yang and Thung [24] addressed the waste classification problem and compared the performance of shallow and deep models. In their work, they collected the TrashNet dataset of 2500 images of single pieces of waste. Based on their data, Bircanoglu et al. [25] and Aral et al. [26] performed detailed comparisons among various deep architectures. Additionally, Awe et al. [27] created a synthetic dataset for waste object detection based on Faster RCNN [16]. Similarly, Chu et al. [28] proposed a hybrid CNN approach for waste classification with a dataset of 5000 waste objects. Vo et al. [29] created another dataset VN-trash with 5904 images for deep transfer learning. Furthermore, Ramalingam et al. [30] presented a debris classification model for floor-cleaning robots with a cascade CNN and an SVM. Yin et al. [31] proposed a lightweight CNN for food litter detection in table cleaning tasks. Rad et al. [32] presented an approach similar to OverFeat [33] for litter object detection. Another similar approach based on Faster RCNN [16] is presented by Wang and Zhang [34]. Contrary to the above works, we address the waste object segmentation problem that requires accurate delineation of object boundaries.
In terms of methods that involves a segmentation component, Bai et al. [35] designed a robot for picking up garbage on the grass with a two-stage perception approach. Firstly, they used SegNet [36] for ground segmentation to allow the robot to move toward waste objects. After a close-range image is acquired, ResNet [14] is used for object classification. Here the segmentation module is used for background modeling of the grassland only, hence no object segmentation is performed. Deepa et al. [37] presented a garbage coverage segmentation method in water terrain based on color transformation and K-means. In addition, Mittal et al. [38] proposed an approach based on the Fully Convolutional Network (FCN) [17] for coarse garbage segmentation. Their method is based on extracting image patches and combining their predictions, and therefore cannot capture the finer object boundary details. Zeng et al. [39] proposed a multi-scale CNN based garbage detection method from airborne hyperspectral data. In their method, a binary segmentation map is generated as the input to selective search [7] for the purpose of obtaining bounding box-based region proposals. All these above works do not address the specific task of waste object segmentation for robotic interaction.
Perhaps being the closest to our work, Zhang et al. [40] proposed an object segmentation method for waste disposal lines based on RGBD sensors. Their method begins with background subtraction on the 3D point cloud, and then attempts to find an optimal projection plane for subsequent object segmentation. Another work [41] from the same group proposed a relabeling method for ambiguous regions after the background is subtracted. Unlike their methods, we take a data-driven approach to address the problem in a much more challenging scenario. Specifically, our method does not assume a particular background model and is able to segment waste objects in both hand-held and in-the-wild scenarios.
Another work that is conceptually similar to ours is from Grard et al. [42]. They explored an interesting interactive setting for object segmentation from a cluttered background. A user is asked to click on an object to extract, and their model uses a dual-objective FCN trained on synthetic depth images to produce the object mask. In our work, however, we aim at a more challenging scenario that does not require human interaction.

Semantic Segmentation
The task of assigning a class label to every pixel in an image is a long-established fundamental problem in computer vision [43][44][45][46][47][48][49][50][51]. Since the pioneering work of Long, Shelhamer and Darrell [17], researchers proposed a large number of architectures and techniques to improve upon the FCN to capture multi-scale context [52][53][54][55][56][57][58] or to improve results at object boundaries [59][60][61][62][63][64][65][66]. Other recent works have explored encoder-decoder structures [36,67,68] and contextual dependencies based on the self-attention mechanism [22,[69][70][71]. For example, some of the recent influential works include PSPNet [21] which proposed the pyramid pooling module to combine representations from multiple scales and RefineNet [72] in which a multi-path refinement network is proposed for high-resolution semantic segmentation. Furthermore, Gated SCNN [73] proposed a two-stream structure that explicitly enhances shape prediction. We note that the performance of semantic segmentation algorithms is clearly related to the backbone architecture being used. For example, the recently proposed ResNeSt [15] model based on Split-Attention blocks provided large performance improvements to a number of vision tasks including semantic segmentation.
We note that our work is similar to DeepLab [74] whose main contributions include the atrous spatial pyramid pooling (ASPP) module to capture the multi-scale context and the use of dense CRF [19] to improve results at object boundaries. In their follow up work [23,75], they improved the ASPP module by adding image pooling and a decoder structure. Our method differs from the DeepLab series in two important aspects. Firstly, our method is tailored to the task of waste object segmentation and introduces layered deep models that perform scene-level parsing and object-level parsing respectively. Secondly, the dense CRF model in our work integrates information from both the scene and object level parsing results, as well as pixel-level affinities which can additionally encode local geometric information via input depth images. In fact, we demonstrate through our experiments that the proposed method is a general framework which can be applied in conjunction with DeepLab and other strong semantic segmentation baselines such as PSPNet [21] and CCNet [22] to improve their results by a clear margin on the waste object segmentation task.
There have been a few works that explored semantic segmentation with RGBD data [76][77][78][79]. For example, FuseNet [80] proposed a two-stream encoder that extracts features from both color and depth images in an encoder-decoder type of network. Qi et al. [81] constructed a graph neural network based on spatial affinities inferred from depth. Contrary to existing works, we use depth affinities as a means to refining waste object segmentation results. Our use of depth information is flexible in that the model can cope with situations where the depth modality is present or absent without re-training.
Lastly, there have been a few works that tackle objectness aware semantic segmentation [10,11]. Apart from addressing the problem in the novel waste object segmentation domain, our method uses a simple yet effective strategy for object region proposal that does not require any additional object or part annotations.

Our Approach
In this section, let us formally introduce the waste object segmentation problem and the proposed approach. We begin with the definition of the problem and notations. Given an input color image and optionally an additional depth image, our model outputs a pixelwise labeling map, as shown in Figure 2. Mathematically, denote the input color image as I ∈ R H×W×3 , the optional depth image as D ∈ R H×W , and the semantic label set as C = {1, 2, . . . , C}, where C is the number of classes. Our goal is to produce a structured semantic labeling x ∈ C H×W . We note that in deep models, the labeling of x at image coordinate (i, j), x ij , is usually obtained via multi-class softmax scores on a spatial-preserving convolutional feature map C ∈ R H×W×C , i.e., . In practice, it is common that the convolutional feature map C is downsampled w.r.t. the original image resolution, but we can always assume that the resolution can be restored with interpolation.

Layered Deep Models
In this work, we apply deep models at both the scene and the object levels. For this purpose, let us define a number of image regions in which we obtain deeply trained feature representations. Firstly, let R 0 = {(i, j) i∈{1...H},j ∈{1...W} } be the set of all spatial coordinates on the image plane, or the entire image region. This is the region in which we perform scene-level parsing (i.e., coarse segmentation). In addition, we perform object-level parsing (i.e., fine segmentation) on a set of non-overlapping object region proposals. We denote each of these additional regions as Details on generating these regions are discussed in Section 3.3. We apply our coarse segmentation feature embedding network F c and the fine segmentation feature embedding network F f to the appropriate image regions as follows: where R l ⇒ I denotes cropping the region R l from image I. Here C 0 ∈ R H×W×C and C l ∈ R H l ×W l ×C , and we note that these feature maps are upsampled where necessary. In addition, the spatial dimension may be image and region specific for both R 0 and R l , which poses a practical problem for batch-based training. To address this issue, during CNN training we resize all image regions so that they have a common shorter side length, followed by randomly cropping a fixed-sized patch as part of the data augmentation procedure. We refer the readers to Section 4.2 for details. In Figure 3, the processes shown in blue and yellow illustrate the steps described in this section.

Coherent Segmentation with CRF
Given the layered deep models, we now introduce our graphical model for predicting coherent waste object segmentation results. Specifically, the overall energy function of our CRF model consists of three main components: where Φ c (x; I) represents the scene-level coarse segmentation potentials, Φ f (x; I) denotes the object-level fine segmentation potentials, and Ψ(x; I, D) is the pairwise potentials that respect the color, depth, and spatial affinities in the input images. α is the weight for the relative importance among the two unary terms. The graphical representation of our CRF model is shown in Figure 3. We describe the details of these three terms below.
Scene-level unary term. The scene-level coarse segmentation unary term is given by where P c (x ij ; I) is a pixelwise softmax on the feature map C 0 as follows: where · denotes the indicator function. This term produces a coarse segmentation map based on the long-range contexts from the input image. Importantly, we use the output of this term to generate our object region proposals, as discussed in Section 3.3. Object-level unary term. The object-level fine segmentation unary term is given by The formulation above states that if a pixel location (i, j) belongs to one of the L object region proposals, a negative log-probability obtained via fine segmentation is adopted. Otherwise, the object-level unary term falls back to the scene-level unary potentials. Here the probability P f (x ij , l; I) given by the fine segmentation model is obtained as follows: where T 1 l (·) and T 2 l (·) are translation functions that map the image coordinates to that of the l-th object proposal region, and C l is the output feature embedding from the fine segmentation model for the l-th object proposal region. We note that the object-level unary potentials typically recover more fine details along object boundaries, as opposed to the scene-level unary potentials. In general, it would become too computationally expensive to compute scene-level potentials at a comparable resolution for the entire image. In most cases, computing the object-level unary term on less than 3 object region proposals are sufficient, see Section 3.3 for details. Additionally, our object-level potentials are obtained via a separate deep model that allows us to decouple the learning of long-range contexts from the learning of fine structural details. Pixel-level pairwise term. Although the object-level unary potentials provide finer segmentation details, accurate boundary details could still be lost for some irregularly shaped waste objects. This poses a practical challenge for detail-preserving global inference.
Following [19], we address this challenge by introducing a pairwise term that is a linear combination of Gaussian kernels in a joint feature space that includes color, depth, and spatial coordinates. This allows us to produce coherent object segmentation results that respect the appearance, depth, and spatial affinities in the original image resolution. More importantly, this form of the pairwise term allows for efficient global inference [19]. Specifically, our pairwise term Ψ(x; I, D) includes an appearance term ψ a (x ij , x uv ; I), a spatial smoothing term ψ s (x ij , x uv ) and a depth term ψ d (x ij , x uv ; D): where x ij = x uv is the Potts label compatibility function. The appearance term and the smoothing term follow [19] and take the following form: where I ij and p ij are the image appearance and position features at the pixel location (i, j). In addition, when a input depth image D is available, we are able to enforce an additional pairwise term induced by geometric affinities: where D ij is the depth reading at the pixel location (i, j). We note that in practice, any missing values in D are filled in with a median filter [82] beforehand, see Section 4.1 for details. In addition, Equation (9) can be conveniently added or removed depending on the depth data availability. We simply discard Equation (9) when training models for the TACO dataset which only contains color images.

Generating Object Region Proposals
In this work, we follow a simple strategy to generate object region proposals R l = {(i, j) i∈{1...H l },j ∈{1...W l } }, 1 ≤ l ≤ L. In particular, the output from the scene-level coarse segmentation model is a good indication of the waste object locations. See Figure 3 for an example. We begin with extracting the connected components in the foreground class labelings of x ∈ C H×W from the maximum a posterior (MAP) estimate of the scene-level unary term Φ c (x; I). For each connected component, a tight bounding box R t l is extracted. This is followed by extending R t l by 30% in four directions (i.e., N,S,W,E), subject to the image boundary truncation. Finally, we merge overlapping regions and remove those below or above certain size thresholds (details in Section 4.2) to obtain a concise set of final object region proposals R l , 1 ≤ l ≤ L. Example object region proposals obtained using this procedure are shown in Figure 4, and we note that any similar implementation should also work satisfactorily.
Most images from the MJU-Waste dataset contain only one hand-held waste object per image. For DeepLabv3 with a ResNet-50 backbone, for example, only 2.4% of all images from MJU-Waste produce 2 or more object proposals. For the TACO dataset, 24.0% of all images produce 2 or more object proposals. However, only 8.1% and 0.6% of all images produce more than 3 and 5 object proposals, respectively.

Model Inference
Following [19], we use the mean field approximation of the joint probability distribution P(x) that computes a factorized distribution Q(x) which minimizes the KL-divergence [83,84] KL(Q||P). For our model, this yields the following message passing-based iterative update equation: where the input color image I and the depth image D are omitted for notation simplicity. In practice, we use the efficient message passing algorithm proposed in [19]. The number of iterations is set to 10 in all experiments.

Model Learning
Let us now move on to discuss details pertaining to the learning of our model. Specifically, we learn the parameters of our model by piecewise training. First, the coarse segmentation feature embedding network F c is trained with standard cross-entropy (CE) loss on the predicted coarse segmentation. Based on the coarse segmentation for the training images, we extract object region proposals with the method discussed in Section 3.3. This allows us to then train the fine segmentation feature embedding network F f using the cropped object regions in a similar manner. Next, we learn the weight and the kernel parameters of our CRF model. We initialize them to the default values used in [19] and then use grid search to finetune their values on a held-out validation set. We note that our model is not too sensitive to most of the parameters. On each dataset, we use fixed values of these parameters for all CNN architectures. See Section 4.2 for details.

Experimental Evaluation
In this section, we compare the proposed method with state-of-the-art semantic segmentation baselines. We focus on two challenging scenarios for waste object localization: the hand-held setting (for applications such as service robot interactions or smart trash bins) and waste objects "in the wild".
In our experiments, we found that one of the common challenges for both scenarios is the extreme scale variation causing standard segmentation algorithms to underperform. Our proposed method, however, greatly improves the segmentation performance in these adverse scenarios. Specifically, we evaluate our method on the following two datasets: • MJU-Waste Dataset. In this work, we created a new benchmark for waste object segmentation. The dataset is available from https://github.com/realwecan/mju-waste/. To the best of our knowledge, MJU-Waste is the largest public benchmark available for waste object segmentation, with 1485 images for training, 248 for validation and 742 for testing. For each color image, we provide the co-registered depth image captured using an RGBD camera. We manually labeled each of the image. More details about our dataset are presented in Section 4. We summarize the key statistics of the two datasets in Table 1. Once again, we emphasize that one of the key characteristics of waste objects is that the number of objects per class can be highly imbalanced (e.g., in the case of TACO [5]). In order to obtain sufficient data to train a strong segmentation algorithm, we use a single class label for all waste objects, and our problem is therefore defined as a binary pixelwise prediction one (i.e., waste vs. background). For the quantitative evaluation that follows, we report the performance of baseline methods and the proposed method by four criteria: Intersection over Union (IoU) for the waste object class, mean IoU (mIoU), pixel Precision (Prec) for the waste object class, and Mean pixel precision (Mean). Let TP, FP and FN denote the total number of true positive, false positive and false negative pixels, respectively. The four criteria used are defined as follows: • Intersection over Union (IoU) for the c-th class is the intersection of the prediction and ground-truth regions of the c-th class over the union of them, defined as: • mean IoU (mIoU) is the average IoU of all C classes: • Pixel Precision (Prec) for the c-th class is the percentage of correctly classified pixels of all predictions of the c-th class: • Mean pixel precision (Mean) is the average class-wise pixel precision: We note that the image labelings are typically dominated by the background class, therefore IoU and Prec reported on the waste objects only are more sensitive than mIoU and Mean which consider both waste objects and the background.

The MJU-Waste Dataset
Before we move on to report our findings from the experiments, let us more formally introduce the MJU-Waste dataset. We created this dataset by collecting waste items from a university campus, bringing them back to a lab, and then take pictures of people holding waste items in their hands. All images in the dataset are captured using a Microsoft Kinect RGBD camera [20]. The current version of our dataset, MJU-Waste V1, contains 2475 co-registered RGB and depth image pairs. Specifically, we randomly split the images into a training set, a validation set and a test set of 1485, 248 and 742 images, respectively.
Due to sensor limitations, the depth frames contain missing values at reflective surfaces, occlusion boundaries, and distant regions. We use a median filter [82] to fill in the missing values in order to obtain high quality depth images. Each image in MJU-Waste is annotated with a pixelwise mask of waste objects. Example color frames, ground-truth annotations, and depth frames are shown in Figure 5. In addition to semantic segmentation ground-truths, object instance masks are also available.

Implementation Details
Here we report the key implementation details of our experiments, as follows: • Segmentation networks F c and F f . Following [21,74], we use the polynomial learning rate policy with the initial learning rate set to 0.001 and the power factor set to 0.9. The total number of iterations are set to 50 epochs on both datasets with a batch size of 4 images. In all experiments, we use the ImageNet-pretrained backbones [85] and a standard SGD optimizer with momentum and weight decay factors set to 0.9 and 0.0001, respectively. To avoid overfitting, standard data augmentation techniques including random mirroring, resizing (with a resize factor between 0.5 and 2), cropping and random Gaussian blur [21] are used. The base (and cropped) image sizes for F c and F f are set to 520(480) and 260(240) pixels during training, respectively. • Object region proposals. To maintain a concise set of object region proposals, we empirically set the minimum and maximum number of pixels N min and N max in an object region proposal. For MJU-Waste, N min and N max are set to 900 and 40,000, respectively. For TACO, N min and N max are set to 25,000 and 250,000, due to the larger image sizes. Object region proposals that are either too small or too big will simply be discarded. • CRF parameters. We initialize the CRF parameters with the default values in [19] and follow a simple grid search strategy to find the optimal values of CRF parameters in each term. For reference, the CRF parameters used in our experiments are listed in Table 2. We note that our model is somewhat robust to the exact values of these parameters, and for each dataset we use the same parameters for all segmentation models. • Training details and codes. In this work, we use a publicly available implementation to train the segmentation networks F c and F f . The CNN training codes are available from: https:// github.com/Tramac/awesome-semantic-segmentation-pytorch/. We use the default training settings unless otherwise specified earlier in this section. For CRF inference we use another public implementation. The CRF inference codes are available from: https://github.com/lucasb-eyer/ pydensecrf/. The complete set of CRF parameters are summarized in Table 2.

Results on the MJU-Waste Dataset
The quantitative performance evaluation results we obtained on the test set of MJU-Waste are summarized in Table 3. Methods using our proposed multi-level model have "ML" in their names. For this dataset, we report the performance of the following baseline methods: • FCN-8s [17]. FCN is a seminal work in CNN-based semantic segmentation. In particular, FCN proposes to transform fully connected layers into convolutional layers that enables a classification net to output a probabilistic heatmap of object layouts. In our experiments, we use the network architecture as proposed in [17], which adopts a VGG16 [86] backbone. In terms of the skip connections, we choose the FCN-8s variant as it retains more precise location information by fusing features from the early pool3 and pool4 layers. • PSPNet [21]. PSPNet proposes the pyramid pooling module for multi-scale context aggregation. Specifically, we choose the ResNet-101 [14] backbone variant for a good tradeoff between model complexity and performance. The pyramid pooling module concatenates the features from the last layer of the conv4 block with the same features applied with 1 × 1, 2 × 2, 3 × 3 and 6 × 6 average pooling and upsampling to harvest multi-scale contexts. • CCNet [22]. CCNet presents an attention-based context aggregation method for semantic segmentation. We also choose the ResNet-101 backbone for this method. Therefore, the overall architecture is similar to PSPNet except that we use the Recurrent Criss Cross Attention  We refer interested readers to the public implementation discussed in Section 4.2 for the network details of the above baselines. For each baseline method, we additionally implement our proposed multi-level modules and then present a direct performance comparison in terms of IoU, mIoU, Prec and Mean improvements. We show that our method provides a general framework under which a number of strong semantic segmentation baselines could be further improved. For example, FCN-8s benefits the most from a multi-level approach (i.e., +7.01 points of IoU improvement), partially due to the relatively low baseline performance. Even for the best-performing baseline, DeepLabv3 with a ResNet-101 backbone, our multi-level model further improves its performance by +3.73 IoU points. We note that such a large quantitative improvement can also be visually significant. In Figure 6, we present qualitative comparisons between FCN-8s, DeepLabv3 and their multi-level counterparts. It is clear that our approach helps to remove false positives in some non-object regions. More importantly, it is evident that multi-level models more precisely follow object boundaries.
In Table 4, we additionally perform ablation studies on the validation set of MJU-Waste. Specifically, we compare the performance of the following variants of our method: • Baseline. DeepLabv3 baseline with a ResNet-50 backbone. • Object only. The above baseline with additional object-level reasoning. This method is implemented by retaining only the two unary terms of Equation (2). All pixel-level pairwise terms are turned off. This will test if the object-level reasoning will contribute to the baseline performance. • Object and appearance. The baseline with object-level reasoning plus the appearance and the spatial smoothing pairwise terms. The depth pairwise terms are turned off. This will test if the additional pixel affinity information (without depth, however) is useful. It also verifies the efficacy of the depth pairwise terms. • Appearance and depth. The baseline with all pixel-level pairwise terms but without the object-level unary term. This will test if an object-level fine segmentation network is necessary, as well as the performance contribution of the pixel-level pairwise terms alone.  Results are clear that the full model performs the best, producing superior performance by all four criteria. This validates that the various components proposed in our method all positively impact the final results.
In terms of the computational efficiency, we report a breakdown of the average per-image inference time in Table 5. The baseline method corresponds to the scene-level inference only; additional object and pixel level inference incurs extra computational costs. These runtime statistics are obtained with an i9 desktop CPU and a single RTX 2080Ti GPU. Our full model with DeepLabv3 and ResNet-50 runs at approximately 0.8 second per image. Specifically, the computational costs for object-level inference are mainly a result from the object region proposals and the forward pass of the object region CNN. The pixel-level inference time, on the other hand, is mostly the result from the iterative mean-field approximation. It should be noted that the inference times reported here are obtained based on public implementations as mentioned in Section 4.2, without any specific optimization.
More example results obtained on the test set of MJU-Waste with our full model are shown in Figure 7. Although the images in MJU-Waste are captured indoors so that the illumination variations are less significant, there are large variations in the clothing colors and, in some cases, the color contrast between the waste objects and the clothes is small. In addition, the orientation of the objects also exhibits large variations. For example, the objects can be held with either one or both hands. During the data collection, we simply ask the participants to hold objects however they like. Despite these challenges, our model is able to reliably recover the fine boundary details in most cases.

Results on the TACO Dataset
We additionally evaluate the performance of our method on the TACO dataset. TACO contains color images only, so we exclude Equation (9) for training and evaluating models on this dataset. This dataset presents a unique challenge for localizing waste objects "in-the-wild". In general, TACO is different to MJU-Waste in two important aspects. Firstly, multiple waste objects with extreme scale variation are more common (see Figures 8 and 9 for examples). Secondly, unlike MJU-Waste the backgrounds are diverse, such as in road, grassland and beach scenes. Quantitative results obtained on the TACO test set are summarized in Table 6. Specifically, we compare our multi-level model against two baselines: FCN-8s [17] and DeepLabv3 [23]. Again, in both cases our multi-level model is able to improve the baseline performance by a clear margin. Qualitative comparisons of the segmentation results are presented in Figure 8. It is clear that our multi-level method is able to more closely follow object boundaries. More example segmentation results are presented in Figure 9. We note that the changes in illumination and orientation are generally greater on TACO than on MJU-Waste, due to the fact that there are many outdoor images. Particularly, in some beach images it is very challenging to spot waste objects due to the poor illumination and the weak color contrast. Furthermore, object scale and orientation vary greatly as a result of different camera perspectives. Again, our model is able to detect and segment waste objects with high accuracy in most images, demonstrating the efficacy of the proposed method.

Conclusions
We presented a multi-level approach to waste object localization. Specifically, our method integrates the appearance and the depth information from three levels of spatial granularity: (1) A scene-level segmentation network captures the long-range spatial contexts and produces an initial coarse segmentation. (2) Based on the coarse segmentation, we select a few potential object regions and then perform object-level segmentation. (3) The scene and object level results are then integrated into a pixel-level fully connected conditional random field to produce a coherent final localization. The superiority of our method is validated on two public datasets for waste object segmentation. As part of our work, we collected the MJU-Waste dataset that is made publicly available to facilitate future research in this area. We hope that our method could serve as a modest attempt to induce further exploration into vision-based perception of waste objects in complex real-world scenarios. For example, possible future work may explore the training of robust segmentation models that work on multiple datasets with large object appearance and camera perspective variations.