Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection

RGB-D salient object detection (SOD) demonstrates its superiority in detecting in complex environments due to the additional depth information introduced in the data. Inevitably, an independent stream is introduced to extract features from depth images, leading to extra computation and parameters. This methodology sacrifices the model size to improve the detection accuracy which may impede the practical application of SOD problems. To tackle this dilemma, we propose a dynamic knowledge distillation (DKD) method, along with a lightweight structure, which significantly reduces the computational burden while maintaining validity. This method considers the factors of both teacher and student performance within the training stage and dynamically assigns the distillation weight instead of applying a fixed weight on the student model. We also investigate the issue of RGB-D early fusion strategy in distillation and propose a simple noise elimination method to mitigate the impact of distorted training data caused by low quality depth maps. Extensive experiments are conducted on five public datasets to demonstrate that our method can achieve competitive performance with a fast inference speed (136FPS) compared to 12 prior methods.


Introduction
Salient object detection (SOD) aims at locating prominent objects in a given scenario under consideration. In recent years, SOD has attracted significant attention and substantial progress has been demonstrated in the field. The object detection task can be treated as a pre-processing methodology that can be subsequently used in diverse fields, such as image understanding [1], video detection and segmentation [2], semantic segmentation [3], object tracking [4], person re-identification [5] and others. However, due to the complicated real-life scenarios, RGB-based SOD still fails in generating satisfactory prediction maps. In order to overcome this issue and obtain better detection performance in complex scenarios, depth images along with an independent network have been introduced to provide supplementary information. Specifically, Figure 1 illustrates three fusion methods in RGB-D based SOD. Existing state-of-the-art methods mainly adopt late fusion or multi-scale fusion and focus on designing feature-enhanced modules and complicated feature-fusion modules, which indeed improve the overall detection results. However, due to the processing of a high volume of information, the models tend to become extremely complicated, leading to an issue that weakens the practicality of SOD using RGB-D data. In addition to the regular strategies, recent approaches [6,7] propose joint learning frameworks and treat the RGB-D based SOD as a multi-task learning problem. However, these frameworks employ extra network branches and supervision labels, which cause an analogous problem with the aforementioned frameworks. Different from them, the early fusion in Figure 1c integrates the separate inputs into a unified representation before the feature extraction process. It provides an alternative strategy to lighten the model but suffers from the noise issue caused by low-quality depth information. That motivates us to explore the potential of early fusion from a novel perspective and compress the model size for SOD while maintaining high detection accuracy.  Recently, knowledge distillation (KD) has been proposed [8] to transfer knowledge from a large model to a smaller one. The main idea is that a small student model mimics a cumbersome model, namely, a so-called teacher model, to achieve competitive performance. The cumbersome network has a larger knowledge capacity than smaller models, but this capacity may not be utilized for its full potential. In other words, a lightweight network can reach a similar performance to a cumbersome network by KD without increasing the number of parameters. Similar to human behaviours, this teacher-student learning process can be implemented by a simple and effective way, which forces the student model to directly learn the final prediction of the teacher model. KD has been applied in a range of machine learning applications. Zhang [9] utilizes the KD to RGB saliency detection and proposes an efficient model by reducing the number of channels. Piao [10] explores the cross-modal distillation on RGB-D data and uses an adaptive weight to distil the depth knowledge from the teacher model. Nevertheless, both adjust their student networks according to the teacher networks and distillation strategies. In addition, the adaptive distillation [10] is proposed for the cross-modal distillation and only considers the performance of the teacher model, which limits the utilization of this KD method.
In order to tackle the above issues from a new perspective, we use a concise framework based on the early fusion strategy for RGB-D based SOD and propose a dynamic knowledge distillation (DKD) weight to help the model pay more attention on hard samples by considering both teacher and student performance. We also investigate the issue of RGB-D early fusion strategy in distillation and propose a simple noise elimination method to mitigate the impact of distorted training data caused by low quality depth maps. Combing these two methods can lead to a reasonable distillation strategy for RGB-D saliency detection. Our final model achieves a good balance between accuracy and model size on widely used benchmarks as shown in Figure 2. In a nutshell, our main contributions can be summarised as follows:

•
We propose a novel dynamic distillation strategy, which can adaptively assign the distillation weight by simultaneously considering the detection performance of the teacher and student networks within the training stage. As a result, the final model can pay more attention on hard samples and improve the overall performance. • We propose a noise elimination method by taking full merit of knowledge prior from the teacher network to alleviate the impact of depth maps with low quality.

Related Work
RGB-D Salient Object Detection. RGB-D based SOD has obtained increasing attention in order to handle object detection tasks in complicated environments. Depth information is firstly introduced by [11], where they model the distribution of depth-induced saliency by using Gaussian mixture models. Zhao [12] proposes a feature-enhanced module and a contrast-enhanced net, which augments the contrast between the foreground and background by fluid pyramid integration. Pang [13] adopts multi-scale fusion and proposes a dynamic dilated pyramid module with adaptive receptive fields, which is generated by densely integrating cross-modal features. Chen [14] constructs a lightweight depth stream and designs a refinement network, which is progressively stacked by guided residual blocks. This method can alternately alleviate the mutual degradation and refine predictions in a progressive way. Zhou [15] leverages a novel feature aggregation network, which utilizes the K-nearest neighbor graph neural networks and the non-local module to dig the geometric cues and global semantic features. Zhang [6] proposes a multi-stage cascaded learning framework and transfers the maximization of joint entropy problem in multi-modal learning tasks to the minimization of mutual information, which can explicitly model the complementary information between the RGB image and depth data. These previous works focus on alleviating the impact of depth maps and enhancing the feature integration through delicate modules and networks.
Knowledge Distillation. Knowledge Distillation was formally publicised by [8] in a teacher-student learning framework. This method proposes an effective way to compress model size and attempts to imitate the human beings' learning mechanism. Cheng [16] designs mathematical metrics to quantify and compare the methods of learning from the teacher model and learning from raw data. They explain the superiority of KD in three aspects. First, more reliable visual concepts can be learned through KD. Second, KD makes the model able to learn various concepts simultaneously. Furthermore, learning from KD can generate more stable optimization directions in the training phase. Zheng [17] proposes a novel divide-and-conquer distillation strategy for dense object detection. They transfer the semantic and localization knowledge separately and show that the student takes more benefit from the original logits distillation than feature imitation. Yang [18] explores the difference between the features of students and teachers and proposes a focal distillation to make the student focus on the teacher's critical pixels and channels. Then they further design a global distillation to help the student learn the relation between pixels. Xu [19] follows the human learning process and proposes a teacher-student collaborative KD. This method combines the teacher-student KD and student self-distillation to enhance the performance. However, the student self-distillation model is built by extra multiple exit classifiers from deep to shallow. Recently, KD has been used in SOD tasks. Zhang [9] designs the student model by reducing the amount of channels and applies multi-scale KD on the corresponding scales between teacher and student models. Piao [10] applies cross-modal distillation on RGB-D based SOD and proposes an adaptive distiller to distil the depth information, which alleviates the impact of low-quality depth maps. Different from the aforementioned methods, our method takes both the performance of teacher and student models into consideration and generates a dynamic weight to control the regular teacher-student KD process. In addition, we analyse the depth issues in the specific RGB-D SOD task and optimize the training phase through a threshold.

Overview
Existing methodologies for RGB-D SOD tend to build two-stream networks in order to process RGB and depth features separately. This two-stream design could improve detection performance but meanwhile introduces a large amount of parameters, which increases the complexity and reduces the practicality of models. Feature pyramid network (FPN) [20] is an effective structure which utilizes multi-scale features in different resolutions to achieve detection tasks. Figure 3 illustrates the overall framework, we do not focus on designing networks and only adopt the classic FPN based on a VGG16 and a VGG19 [21] as the student model. In order to obtain a stronger teacher model, we employ four receptive field blocks [22] in multi-scale layers to boost the detection performance. Considering different cross-modal fusion strategies, we choose the simple early fusion way which directly concatenates RGB images and depth images to form four-channel inputs. Similar to normal KD, we transfer the probability distribution of the final layer from the teacher model to the student model by utilizing the so-called DKD.

Dynamic Knowledge Distillation
As mentioned above, KD benefits the student model but the weight of knowledge transfer is still hand-designed. Piao [10] proposes an adaptive weight for cross-modal distillation. However, in [10] they only distill the depth information by considering the performance of teacher model. In our method, we consider both performances of teacher and student networks and combine these two factors as a dynamic weight for KD.
Concretely, the accuracy of teacher model represents the detection performance which also indicates the confidence of knowledge. Inspired by IOU [23] used in SOD, we design a dynamic factor α t to modulate the correct knowledge which can be transferred from the teacher model as follows: where P t and G represent the prediction of teacher model and the ground truth, respectively. α t indicates the confidence of knowledge which can be transferred to the student model. Then, we propose another dynamic factor β s to show the degree of desired knowledge for the student model as follows: where P s represents the prediction of student model. This dynamic factor β s is error rate of the current training sample. In other words, KD should also consider the current performance of student model. β s is inversely related to the accuracy between the output of student model and the ground truth. This indicates that hard samples which have large error rates need to learn more from the teacher model. Therefore, we propose a simple and effective formulation to find a plausible distillation weight θ t,s : here tanh is treated as a scale function: More specifically, we define the θ t,s by the weighted geometric mean of the knowledge confidence α t from teacher and the knowledge demand β s from student. We define the hyper-parameter p ∈ [0, 1] to balance the ratio between the teacher and student networks. It is worth noting that large variation of θ t,s leads to convergence issue in training phase. In this case, we further use a tanh function to scale the θ t,s . The overall loss function can be formulated as: where L KL is the Kullback-Leibler divergence loss and L CE represents the cross-entropy loss. In the final network, we set the distillation temperature to 5 in L KL and p = 0.7.

Noises Elimination with the DKD
As mentioned above, we simplify the procedure of KD and the student network architecture in RGB-D task. Concretely, we only distill the final output distribution and abandon the depth stream by concatenating RGB and depth maps to form a four-channel input. However, this fusion strategy suffers from the noise issue caused by low-quality depth information. As illustrated in Figure 4, we investigate the reasons that cause the distortion of depth maps: (1) besides the salient object, other objects in depth image dominate salient features; (2) low contrast between salient object and background in depth; (3) depth distortion caused by camera. Intuitively, training loss is supposed to reduce drastically if the training data are distorted. Therefore we propose that these depth maps can be treated as noises when combining with RGB maps and further set an accuracy threshold during KD to control the impact of noises: where indicates a small weight which is set to 0.01 in this paper. α t provides a knowledge prior from teacher network and indicates whether the depth distortion happens. Here threshold is set to 0.5. Under this circumstance, the student model is able to know the useless training data when receiving knowledge from the teacher model. Compared to considering one aspect or enforcing a fixed weight to the student model, our dynamic weight considers both the correctness of teacher's knowledge and the error of student network, which allows the student network to receive the knowledge according to the degree of difficulty of samples. θ t,s varies little in the start of training phase. As for late stage of training, the student network is able to detect most simple scenarios except for some hard samples. Therefore, θ t,s automatically assign to relatively bigger weights for hard samples which can be detect accurately in teacher network but student network. The noise elimination method takes full merit of the knowledge prior from teacher network and effectively reduce the negative impact of depth maps in low quality. Extensive experiments demonstrate in section 4 that this DKD could boost the detection performance without increasing extra parameters and model size. The process of the proposed methods is illustrated as Algorithm 1.

Algorithm 1 DKD
Require: P t is the prediction of teacher network, P s is the prediction of student network, G is the corresponding ground truth.

Datasets and Evaluation Metrics
Datasets. Extensive experiments are conducted on five widely used RGB-D datasets, namely, NLPR [24], NJUD [25], SIP [26], DES [27] and LFSD [28]. These datasets contain large-scale images with different resolutions and diverse scenarios. We adopt the same training dataset with [12], which contains 1500 samples from NJUD and 700 samples from NLPR. The rest images in these two datasets together with other three datasets are used for testing.
Evaluation Metrics. We adopt five metrics to comprehensively evaluate SOD tasks. These metrics include the F-measure curves, the F-measure score (F β ), the Mean Absolute Error (M), the S-measure (S α ) and the E-measure (E θ ). Specifically, F β measures the accuracy of the model as follows: where β 2 is set to 0.3 as default. M measures the error rate of the model as follows: where W denotes the width and H denotes the height of prediction. S is the prediction saliency map and G is the corresponding ground truth.

Implementation Details
Our model is implemented using Pytorch Toolbox and trained on a GTX TITAN X GPU for 40 epochs with mini-batch size 4. We use a VGG16-and VGG19-based FPN as our final student architecture. Both RGB and depth images are resized to 256 × 256. To avoid overfitting, simple flipping and rotating are adopted to augment the training dataset. The initial learning rate is set to 1 × 10 −3 and we adopt a 0.0005 weight decay for the stochastic gradient descent (SGD) with a momentum of 0.9.
Quantitative Evaluation. Table 1 shows the quantitative results over five datasets. It can be observed that our method achieves the best scores in most metrics, especially on the NJUD dataset which contains 500 testing image pairs, for which our method performs better as far as all metrics are concerned. As for LFSD and SIP, although higher results come from other methods, we still obtain competing results in smaller VGG16 and VGG19 based networks. Figure 5 shows the comparison results using one-dimensional curves. Our method is represented by the red line which demonstrates better overall performance in both lightweight student models. In addition, it is apparent in Table 2 that our final VGG16-based network only has 57.9 MB with a faster speed, which drastically improves the inference speed and reduces the number of parameters. The above results indicate that without designing complicated models, accurate detection results can be obtained by only using an FPN with the help of the proposed methods. Qualitative Evaluation. Figure 6 exhibits the visual comparisons with prevalent methods in recent years. Images contain diverse objects and scenarios, which are picked from different testing datasets. It can be observed that the saliency maps generated by our method are closer to the ground truth. More specifically, row 1 shows the case where the depth image has low contrast especially on the bottom part and row 2 and 3 show complex backgrounds in the RGB images. Under these circumstances, our method generates better saliency maps with less distortion and irrelevant objects compared to other methods. Table 1. Quantitative comparisons through the maximum of F-score F β , S-score S α , E-score E θ , and error-score M, over five widely evaluated datasets. ↑ and ↓ indicate that larger and smaller scores are better. Ours and Ours* indicate simple VGG16-based and VGG19-based FPN repsectively. ite refers to the training iterations.

Ablation Studies
Dynamic Knowledge Distillation. As shown in Table 3, our baseline is an FPN with VGG19 backbone trained on a cross-entropy loss, which can achieve the basic detection task. RGB indicates that only using RGB maps in training and RGBD concatenates depth maps as input. It is worth noting that directly using early fusion strategy also shows potential in RGB-D saliency detection. Then, we employ KD on the baseline to compare the results in different weights on four datasets. It is observed that KD can improve the performance and our DKD achieves better results across four testing datasets. Furthermore, Figure 7 shows detection performance of the student network in different KD weights. It is observed that in the last 10,000 iterations of training stage, the proposed method has better overall accuracy, where the lowest accuracy is still above 0.4, which is even better than the teacher network. Specific examples in Figure 8 illustrates that compared with the DKD, conventional fixed weights suffer from more false positives and negatives. Consequently, it is demonstrated that our DKD adaptively controls the KD in an appropriate way, leading to the improvement of overall detection performance.
Noise Elimination with DKD. We investigate the low accuracy issue in teacher network by visualising extremely hard samples which has been shown in Figure 4. It is observed in Table 3 that with the help of noise elimination, all evaluation metrics over four testing datasets approach better results. Red arrows and rectangles in Figure 9 label the details which are refined by the proposed methods, especially on the part of object in the low contrast background, further illustrating that the proposed noise elimination effectively mitigate the noise during distillation and make the student model be able to learn more semantic details in the useful training data.    RGB DKD GT Baseline Depth DKD+NE Figure 9. Ablation studies of the proposed methods. Baseline represents that the student network is only trained on cross-entropy loss. DKD represents the proposed DKD and NE means noise elimination.

Further analysis.
In order to show the generalization of the proposed DKD, we replace the teacher network to DANet. Experimental results in Table 4 indicate that our DKD can compress the model size of existing method and approach similar accuracy to the teacher model, especially on the DES, where the performance of student model even outperforms the teacher model. To this end, the proposed dynamic distillation strategy can be explored on different teacher models. We further conduct experiments on VGG16-based FPN with different KD hyper-parameters as shown in Table 5. Specifically, we set temperature to 10 and only use the RGB images in the distillation training phase. Experimental results demonstrate that the proposed DKD can be utilized on different networks with different training settings, proving the effectiveness and generalization of DKD and leading to the potential of achieving RGB-D SOD tasks through RGB data within a lightweight structure.

Conclusions
In this paper, we propose a DKD strategy and a noise elimination method for RGB-D based SOD. The proposed dynamic strategy considers the performance of both teacher and student models to generate an adaptive weight for KD. In order to reduce the final model size, we adopt the early fusion strategy for features fusion from different domains and the simple FPN as the final student model without designing extra networks. In addition, we investigate the noise issue caused by depth maps and alleviate this problem by setting a threshold during KD. The propose methods can be exploited on different teacher models and provide a new perspective which avoids designing extra networks for RGB-D SOD. We conduct comprehensive experiments on five challenging benchmark datasets to demonstrate that our method achieves competitive performance by only using a simple FPN model, which significantly compresses the model size and increases the inference speed. We further apply this dynamic strategy on different distillation temperatures with diverse models to prove the effectiveness and generalization of our method.