Bi-Directional Pyramid Network for Edge Detection

: Multi-scale representation plays a critical role in the ﬁeld of edge detection. However, most of the existing research focuses on one of two aspects: fast training and accurate testing. In this paper, we propose a novel multi-scale method to resolve the balance between them. Speciﬁcally, according to multi-stream structures and the image pyramid principle, we construct a down-sampling pyramid network and a lightweight up-sampling pyramid network to enrich the multi-scale representation from the encoder and decoder, respectively. Next, these two pyramid networks and a backbone network constitute our overall architecture, a bi-directional pyramid network (BDP-Net). Extensive experiments show that compared with the state-of-the-art model, our method could improve the training speed by about one time while retaining a similar test accuracy. Especially, under the single-scale test, our approach also reaches human perception (F 1 score of 0.803) on the BSDS500 database.


Introduction
Edge detection aims to extract perceptually salient edges and object boundaries from natural images. As a fundamental problem of image processing, edge detection has a significant impact on high-level feature extraction, feature description, and image understanding. Thus, it is closely related to many computer vision problems, including object recognition [1], image segmentation [2,3], and medical imaging [4,5].
Multi-scale representation learning is a long-standing topic in the field of computer vision, including edge detection. Edge detection is initially considered to be a low-level task until [2] applies local cues of multiple sizes to attain state-of-the-art at that time. In recent years, with the success of deep convolutional neural networks in computer vision, including object detection [6,7] and semantic segmentation [8][9][10], some multi-scale learning methods to fuse multiple kinds of hierarchical features for edge detection have been developed. For example, HED [11] utilizes deeply supervised nets [12] to supervise outputs of different network stages. Based on this, RCF [13] makes further efforts to enrich the features of different levels in the same network stage. Nevertheless, [11,13] do not fully take advantage of the contextual information of objects in a natural image, since constraints on the neighboring pixel labels are not directly enforced in their encoder architectures. On the basis of RCF, BDCN [14] attempts to resolve this issue by a pseudo bi-directional cascade structure (BDC) and scale enhancement module (SEM). As SEM can capture rich spatial contexts through multi-stream learning in the same network stage, it plays a major role in making BDCN state-of-the-art. Specifically, this module is implemented by multiple dilated convolutions with various sampling rates. However, SEM needs to be utilized in every network stage, so it introduces some additional parameters and significantly increases training time.
To specify the performance of different methods, namely test accuracy and training speed, we show some relevant results in Table 1. Note that RCF [13] has an advantage over its previous approaches (including HED [11]) in terms of model performance. Therefore, here we do not show the performance of its previous methods. From Table 1, one may find that RCF and BDCN focus on one of two aspects: fast training and accurate testing, respectively. In order to take both factors into account, we attempt to consider a novel multi-scale learning method to deal with edge detection of natural images. To describe our architecture clearly, we start from the encoder and decoder, respectively. Table 1. Single-scale test performance of different networks with 15 K training iterations on BSDS500. Here, '-' represents 'not used'. The metrics ODS and OIS refer to F 1 score based on optimal dataset scale and optimal image scale, respectively. Note that ' para' refers to the model parameters and '+' means the additional parameters on the basis of RCF [13]. 'GFLOPS' indicates giga floating-point operations per second. 'TT' and 'FPS' represent the training time of the model and the average frames per second during training, respectively. First, we reconsider how to strengthen the encoder. Besides the multi-stream learning approach in the same network stage, like SEM, the one across network stages often appears in computer vision, such as [9,15]. However, this usually also consumes magnitude computations because of frequent up-sampling in the encoder. For a given image, the computational complexity of up-sampling is significantly higher than that of down-sampling. Intuitively, a feasible alternative is to apply down-sampling to multi-stream learning across network stages. Meanwhile, the amount of down-sampling had better not exceed a certain threshold, such as the total number of network stages. Based on the above ideas, we propose a simple and naive down-sampling pyramid network (DSP). Specifically, based on the image pyramid principle, according to the number difference between the current network stage and the first stage, we transform the original input image into a lower resolution one, and then fuse the down-sampling image and the corresponding features from the backbone to heighten the previous encoder further.

ODS
Moreover, we consider how to enhance the decoder architecture. Instead of simply utilizing channel concatenation like RCF [13], inspired by U-Net [16], we propose an up-sampling pyramid network (USP) to fuse information from different network stages. In particular, we restore the resolutions of feature maps gradually and then hierarchical features in our USP are separately fused with the corresponding ones in the encoder. To make our USP lightweight, we compress the feature maps in the encoder by a series of 1 × 1 convolutions. Finally, the triple mechanisms (i.e., our DSP, lightweight USP, and a trimmed VGG16 as the backbone) make up our bi-directional pyramid network (BDP-Net).
To sum up, our contributions are three-fold: • Firstly, a down-sampling pyramid network is proposed to enrich the multi-scale representation of the previous encoder. To our knowledge, it is the first time that a down-sampling pyramid network has been used to enrich multi-scale features in the field of edge detection.
• Secondly, a lightweight up-sampling pyramid network is proposed to enhance the multi-scale representation of the previous decoder. Combining these two pyramid networks with a trimmed VGG16 makes up our bi-directional pyramid network (BDP-Net). • Last but not least, while retaining the test accuracy with BDCN [14], the state-of-theart model in the field of edge detection, the proposed BDP-Net is experimentally demonstrated to improve the training speed by about one time.
The remainder of the paper is organized as follows. Section 2 reviews the related work. Section 3 introduces our method and then Section 4 experimentally demonstrates its effectiveness and compares it with some state-of-the-art approaches. Finally, Section 5 provides a summary of the paper.

Related Work
Since our work is mainly related to multi-scale representation learning as well as edge detection, we give a brief review of these in the following.

Edge Detection
Edge detection is a long-standing topic dating back to work [17] in 1959. Since then, extensive literature has emerged about edge detection. Generally, one may divide these approaches into three categories: edge differential operators, traditional machine learning methods, and systems based on deep learning.
Edge differential operators are early pioneering methods (e.g., Robinson [18] and Sobel [19]), which mainly depend on the information of color and intensity. Specifically, they refer to the first or second derivative information. By introducing the idea of nonmaximum suppression to edge detection, Canny [20] is widely used across various tasks and turns into a representative of such methods. However, since these methods rely on the manual design of edge detection operators instead of learnable patterns, they usually have poor accuracy on test data.
The traditional machine learning approaches, such as [2,[21][22][23][24][25], generally apply datadriven methods to supervised learning by virtue of manual features. For instance, Martin et al. [21] formulate some low features about color, brightness, and texture in natural images, and then combine the information and their ground truth to train a classifier. Dollár et al. [24] make full use of predictions of local edge masks based on structured learning. Sam et al. [25] propose an efficient model, oriented edge forest, by combining a random forest classifier and a clustering method based on local edge orientations. Nevertheless, the edge information extracted from these methods is still relatively limited, even if designing sophisticated feature engineering.
Due to the emphasis on learning features automatically, methods based on deep learning, especially convolution neural networks (CNNs), have become the current mainstream in the field of edge detection. The early approaches based on deep learning, such as DeepContour [26], DeepEdge [27] and N 4 -Fields [28], mainly utilize patch-to-patch or patch-to-pixel strategies, which generally weaken prediction effectiveness. HED [11] first implements image-to-image prediction by leveraging fully convolutional neural networks. RCF [13] makes further efforts to enrich the hierarchical features by numerous 1 × 1 convolutions. BDCN [14] relies on a bi-directional pseudo-cascade structure and scale enhancement module to push forward state-of-the-art. In addition, LPCB [29] proposes a bottom-up/top-down architecture with a combination of cross-entropy and the Dice loss, to achieve good prediction for edge detection. However, as mentioned in Section 1, these methods are either insufficient in context information or inefficient in training.

Existing Multi-Scale Learning
Extracting and fusing multi-scale features is an enduring topic in computer vision, including edge detection. According to different locations used in a deep learning architec-ture, multi-scale methods are divided roughly into two groups: those outside the network and those within the network.
The representative methods outside the network apply multi-scale inputs to a shared network or multiple networks and then fuse the multi-scale outputs. For example, Arbeláez et al. [30] begin with a multi-resolution image pyramid to perform multi-scale combinatorial grouping. Clément et al. [31] transform the original images through a Laplacian pyramid into multi-scale images that are fed to different networks. In addition to training, multi-scale inputs could be used for testing. For instance, Liu et al. [13] resize an original image to three resolutions, and then these are separately fed to the same network to obtain three outputs. Finally, these outputs are first restored to the original size and then integrated to achieve high-quality edge maps through a simple average. Although the multi-scale input method can improve the test accuracy, it also increases model training or testing time.
From the perspective of network stages, the multi-scale methods used within the network (see Figure 1) could be further divided into two groups, i.e., those used in the same stage and those applied in different stages. In views of different forms, we divide multi-scale approaches used in the same network stage into two categories: (a) Multi-level learning. In relation to a specific network, convolutions of different depths at the same stage have various receptive fields and then learn features of different levels. A typical example is RCF [13]. However, as mentioned in Section 1, RCF underutilizes the contextual information of objects in a natural image.
(b) Modular structure. Due to strong scalability, a large number of relevant methods have emerged in recent years, such as the Inception Series [32][33][34], Pyramid Pooling Module [35], ASPP [36], and SEM [14]. In addition to such parallel structures, some series structures like dense blocks [37] have been introduced. Generally, the modular structure could obtain good test results at the cost of increasing the model parameters and massive computation during training.
In relation to multi-scale methods applied in the different network stages, we divide these into the following categories: (c) Multi-scale formed by different network stages. Generally, a deeper network stage often corresponds to a larger receptive field. For instance, Xie et al. [11] leverage the side-output layers to learn nested multi-scale features for edge detection. Lin et al. [6] introduce multi-scale outputs from different stages in the up-sampling feature pyramid. Due to a lack of sufficient context information in their encoder, their test accuracy still needs to improve.
(d) Encoder-decoder combined with skip-layers. For instance, U-Net [16] makes full use of the same level information from the encoder and decoder architecture. However, its decoder and encoder are almost the same in structure, which results in redundant parameters and calculations.
(e) Multi-scale across network stages. Hou et al. [15] and Zhang et al. [9] fuse information derived from the current stage and all the deeper stages and then provide a performance improvement. However, due to frequent up-sampling, their methods usually consume magnitude computations.
(f) Our architecture (BDP-Net). A down-sampling and lightweight up-sampling pyramid network are introduced. Contrary to [9,15], we up-sample two times at almost every network stage, which is a result of the down-sampling of the original input. Further, the proposed architecture could cut down many calculations and speed up the training process. Next, we introduce our method in detail.

Methodology
In this section, we introduce our architecture based on VGG16 [38]. In comparison with the raw VGG16, we not only remove all the fully connected layers and the 5th pooling operation, but also introduce modifications for the encoder and decoder as follows.

The Proposed Down-Sampling Pyramid Network
To make full use of contextual information in natural images, inspired by the multiscale learning approaches [9,15] across network stages, we introduce the down-sampling pyramid network (DSP) by the image pyramid principle. Specifically, we propose some bypass networks to down-sample the original image and then these parallel branches constitute our DSP, as shown on the left of Figure 2. Note that one may conduct the above DSP by 1 × 1 convolutions with different strides, or combining 1 × 1 convolutions with stride 1 and bilinear (or bicubic) interpolation. The down-sampling stride is obtained according to the number difference between the current stage and the first stage. (Note that we adopt 1 as the stride of the 4th pooling like Liu et al. [13]. Thus, the stride of the last 1 × 1 convolution in the down-sampling pyramid network is 8, not 16. In addition, the last information fusion is before the 4th pooling operation.) Next, the current down-sampling image is fused with the output from the previous stage in the backbone. Finally, the fusion information is fed to the next stage in our encoder. For example, the inputs of the third stage in our encoder contain the output of the second stage and the feature map obtained by 2 3−1 down-sampling about the original image.
Further, the modified encoder is composed of this DSP and the previous encoder. Compared with the state-of-the-art model BDCN [14], our proposed DSP introduces no additional hyper-parameters. (Specifically, BDCN [14] needs to adjust two kinds of hyperparameters, namely the number and the sampling rate of dilated convolutions.)  Figure 2. The framework of our BDP-Net. The left and right paths in orange correspond to our down-sampling pyramid network (DSP) and up-sampling pyramid network (USP), respectively. Here, '3 × 3 − 64' represents a convolution with kernel 3 and channel 64, and '2 × 2 pool' means a max-pooling with stride 2 in height and width. 's = 2' indicates that the down-sampling stride is 2. '⊕' means information fusion. 'deconv × 2' and 'BI × 2' refer to twice up-sampling using deconvolution and bilinear interpolation, respectively. The rest are similar. In addition, 'sigmoid/loss' indicates using the sigmoid layer to obtain a prediction and then calculating the corresponding loss.

The Proposed Lightweight Up-Sampling Pyramid Network
In the field of edge detection, RCF [13] makes use of abundant 1 × 1 convolutions to compress channel information and obtains rich hierarchical features by five side-outputs and a fusion output. In terms of the fusion output, RCF directly restores the resolutions of feature maps from different stages to the original input size and then applies one channel concatenation and one convolution operation to obtain the output. Evidently, such a fusion output in RCF is too simple to get rich information. BDCN [14] also has this problem, because it utilizes a fusion output similar to RCF.
Motivated by U-Net [16], we propose the lightweight up-sampling pyramid network (USP) to enrich multi-scale features from the decoder. Specifically, we take double and possible quadruple up-sampling for the feature map in the current phase (referring to stages 2, 3, and 4) and then fuse the information from the corresponding stage in our encoder, as shown on the right of Figure 2. Note that the fusion method involves adding between feature maps instead of concatenation. Hence, the channels from our decoder are the same (i.e., 1), as shown in Block_2 from Figure 2, which leads to a lightweight decoder.
Finally, our DSP, lightweight USP, and the trimmed VGG16 as the backbone, make up a novel encoder-decoder architecture: the bi-directional pyramid network (BDP-Net). One can see Section 4.3 for details of the model parameters and training speed of the proposed network. The adopted loss function originates from [14].

Experiments
In this section, we start with three datasets, state details of the implementation in our experiments, and then discuss the effectiveness of our network and make comparisons with some state-of-the-art models.
BSDS500 consists of 200, 100, and 200 images for training, validation, and testing, respectively. The ground-truth is obtained by average, as multiple annotators label every image. Its training and validation sets are used to finetune, and its test set is applied for evaluation. In relation to data augmentation, we utilize the same strategies as [11,13,14], which embrace three data transformations, namely scaling, rotating, and flipping. To expand the training set further, we also increase the flipped PASCAL VOC Context dataset.
NYUDv2 contains 381, 414, and 654 images for training, validation, and testing, respectively. Its ground truth provides binary annotations. We utilize the training and validation sets for finetuning, and the test set for evaluation. The data augmentation strategy is the same as [11,13,14]. Unlike BSDS500, it includes paired depth images in addition to RGB images, as it is initially applied for scene understanding. Following previous work [11], we apply this information by HHA features, namely horizontal disparity, height above ground, and angle with gravity.
Multicue includes 100 challenging natural scenes. There are ten frames taken from the left and right views in every scene. The last frame of the left view for each scene is annotated for two annotations, namely object boundaries and edges. Here we utilize 80 and 20 images for training and testing, respectively. Regarding data augmentation, we apply the same strategy as [13,14].

Implementation Details
On the three databases, the batch size was set to 10 for all the experiments. In the training phase, we applied full resolution images for BSDS500 and NYUDv2, and randomly cropped 500 × 500 patches for Multicue as its image resolution is high.
All experiments were conducted on a single GeForce GTX 1080Ti. Our architecture was implemented by using the publicly available PyTorch (https://github.com/pytorch/ pytorch). In the experiments, the VGG16 [38] pre-trained on ImageNet [41] was applied to initialize the backbone.
SGD was utilized to train in this paper. The initial learning rate was 10 −6 except 10 −7 for edge detection on Multicue. The learning rate decayed by 10 times after every 10k iterations. Furthermore, momentum and weight decay were set to 0.9 and 2 × 10 −4 , respectively. We trained 15 K iterations for BSDS500 and NYUDv2, 2 K and 1 K iterations for Multicue boundary and edge, respectively.
Standard non-maximum suppression is usually utilized to produce the thinned edge maps before evaluation. The maximum tolerance matching between edge predictions and ground-truth annotations was set to 0.0075 for BSDS500 and Multicue, and 0.011 for NYUDv2. In addition, the random seed was set to a fixed value of 7 to reduce the random error in our experiments.
Concerning evaluation metrics, we adopted the training speed (frame per second, abbreviated as FPS), the average precision (AP) on the full recall range (equivalently, the area under the precision-recall curve), as well as ODS and OIS, which correspond to F 1 score based on the thresholds from the optimal dataset scale and optimal image scale, respectively. Note that the F 1 score is defined as (1) where TP, FP, and FN refer to true positive, false positive, and false negative, respectively. Here, P and R represent precision and recall, respectively. In addition, we utilize the evaluation metric introduced in [24], R50, which represents the recall at 50% precision. We also discuss the computational complexity of different models. The computational complexity of a model usually includes spatial complexity and time complexity. Generally, one utilizes the number of floating-point operations per second (FLOPS) to represent the time complexity. In addition, one applies the parameters of the model to roughly obtain the spatial complexity.

The Effect of Network Architecture
In this section, we conduct experiments to discuss the effectiveness of the proposed network. For a fair comparison with RCF [13] and BDCN [14], we use the modified VGG16 as our backbone network.

Ablation Study on BSDS500
As our baseline, the modified VGG16 removes all the fully connected layers and the 5th pooling operation. Figure 3 reveals the single-scale test results on BSDS500. Compared with the baseline, the proposed down-sampling pyramid network (DSP) and light up-sampling pyramid network (USP) help to achieve higher ODS and OIS, respectively. Meanwhile, BDP-Net, as a combination of both mechanisms, could further improve the evaluation metrics.
In addition to the baseline model, we also compare our method with two state-of-theart models, i.e., RCF [13] and BDCN [14], by reproducing their experiments. From Figure 3 and Table 2, we can see that: (i) Our USP helps to achieve better generalization earlier than RCF. In particular, under the single-scale test, it achieves higher ODS and OIS (0.799 and 0.813) than RCF (0.792 and 0.809), and concurrently the training time is 4 h 55 min for 10 K iterations and 7 h for 15 K iterations, respectively.
(ii) Compared with BDCN, the proposed BDP-Net utilizes nearly half of the training time while retaining comparable metrics. Specifically, under 15K training iterations, BDP-Net needs 7 h 42 min to obtain ODS, OIS, and AP (0.803, 0.822, and 0.845) while BDCN requires 12 h 35 min to achieve the performance with the metrics (0.804, 0.820, and 0.724) under the single-scale test.
(iii) Note that our BDP-Net also catches up with human perception (0.803) under the single-scale test while using only the training set from BSDS500. In terms of the training speed, BDP-Net is nearly twice as fast as BDCN [14].   Moreover, compared with the USP, our DSP could achieve a similar test accuracy and training speed. In the following, we discuss the computational complexity of different models. Since using the same group of networks (e.g., RCF, BDCN, and our BDP-Net) and the same configurations on three public datasets, we only compare the model parameters through Table 2 in this section. To compare the parameters of different models conveniently, we utilize RCF as the base instead of the modified VGG16. In Table 2, compared with RCF (14.8 M parameters and 81.77 GFLOPS), our down-sampling and lightweight up-sampling pyramid network add negligible parameters and floating-point operations. Their combination, BDP-Net, has a smaller increase in complexity (3.9 K parameters and 0.04 GFLOPS). Meanwhile, BDCN requires significantly more parameters and floating point operations (1.5 M parameters and 32.54 GFLOPS), thus its training efficiency is greatly reduced.
Finally, we show the test accuracy of more methods in Figure 4. Specifically, apart from HED [11], RCF [13], BDCN [14], and our BDP-Net, it includes (1) differential operator, Canny [20]; (2) traditional machine learning methods, including gPb-UCM [2], Pb [21], EGB [23], SE [24], OEF [25], and MShift [42]; (3) early deep learning methods based on image patches, such as DeepContour [26] and DeepEdge [27]. Compared with the non-deep-learning approaches, the deep learning methods show obvious advantages. Compared with the deep learning systems based on image patches, recent approaches (e.g., HED, RCF, and BDCN) combine with multi-scale learning and image-to-image training to further improve the test accuracy. In comparison with BDCN, the state-of-the-art, our BDP-Net achieves a similar test accuracy on the BSDS500 dataset. However, when adding more training data (e.g., PASCAL Context dataset [43]), our approach is slightly inferior, which sheds light on further improvement. In addition, we also display the prediction results obtained by different methods, as shown in Figure 5.  Figure 4. The precision-recall curves of various approaches using a single-scale test on BSDS500. We also display multi-scale versions of some approaches, i.e., BDCN-MS, Ours-MS, and RCF-MS, which are trained by utilizing an additional PASCAL Context dataset [43]. Note that the metrics precision and recall are obtained by using the optimal dataset scale during evaluation.

Performance on NYUDv2 and Multicue
In terms of NYUDv2, we first performed experiments on RGB and HHA images separately, and then averaged their predictions to obtain the results of RGB-HHA. Table 3 shows the evaluation metrics of RCF [13], BDCN [14], and our BDP-Net on NYUDv2. In relation to HHA feature images, there is little difference for the metrics (ODS, OIS, AP, and R50) obtained by leveraging three different networks. For RGB images, our method has a slight advantage in terms of average precision. In addition, compared with RCF, our BDP-Net has an approximate floating point operation and training speed while BDCN adds a large number of floating-point operations and then its training speed is significantly reduced. Compared with RCF and BDCN for NYUDv2 RGB images, our method gains an advantage in balancing effectiveness and efficiency. Please see Figure 6 for the precisionrecall curves of different methods. We also display some visual results in Figure 7.  Figure 6. The precision-recall curves of various approaches on NYUDv2. Note that the metrics precision and recall are obtained by using the optimal dataset scale during evaluation.  For Multicue, we carried out two distinct visual tasks, namely object boundary and edge. Table 4 reveals some evaluation results of different architectures. Concerning object boundary, BDP-Net attains almost as good a performance with respect to the given metrics as BDCN [14] while retaining the similar training speed with RCF [13]. Meanwhile, compared with RCF and BDCN for edge detection, our BDP-Net achieves growth of about two points on the metrics (ODS, OIS, AP, and R50). In addition, compared with RCF, our approach has similar floating-point operations and slightly lower training speed. However, BDCN possesses a significant increase in floating-point operations. Moreover, it has more parameters, and thus its training efficiency is greatly reduced. All in all, it seems to have no advantage on such a small database. In addition, please see Figures 8 and 9 respectively for the precision-recall curves and the visual results of different methods for Multicue.  Figure 8. The precision-recall curves of various approaches on Multicue. The subscripts "e" and "b" represent the tasks of edge and boundary detection, respectively. Note that the metrics precision and recall are obtained by using the optimal dataset scale during evaluation.

Conclusions and Discussion
To sum up, in this paper, we developed a novel multi-scale learning method, BDP-Net, to deal with contextual information of objects in natural images. According to our extensive experiments, one can find that compared with the state-of-the-art model BDCN [14], our BDP-Net can cut down about half of the training time while retaining a similar test accuracy.
However, from Figure 4 and Table 3, we find that compared with BDCN, our BDP-Net still has some room for improvement in the test accuracy when enough training data are available. To our knowledge, common down-sampling methods, such as bilinear interpolation and 1 × 1 convolution with stride 2 or more, usually lose some spatial position information in natural images. Therefore, our down-sampling pyramid network also has a similar problem. Besides, to reduce the loss of spatial information, the three methods (namely our method, the previous RCF, and BDCN) apply down-sampling three instead of four or more times in the backbone network. Furthermore, their training speed still needs to improve. A possible solution could be to combine model compression with a transformer, such as in [44,45]. In addition, some modules from neural architecture search (NAS), such as Lite R-ASPP [46], potentially provide novel ideas for edge detection. However, these are out of the scope of this work. For more details of NAS modules, please refer to the relevant literature [47,48].