Multitask Coupling Network for Occlusion Relation Reasoning

: Analysis of the occlusion relationships between different objects in an image is fundamental to computer vision, including both the accurate detection of multiple objects’ contours in an image and each pixel’s orientation on the contours of objects with occlusion relationships. However, the severe imbalance between the edge pixels of an object in an image and the background pixels complicates occlusion relationship reasoning. Although progress has been made using convolutional neural network (CNN)-based methods, the limited coupling relationship between the detection of object occlusion contours and the prediction of occlusion orientation has not yet been effectively used in a full network architecture. In addition, the prediction of occlusion orientations and the detection of occlusion edges are based on the accurate extraction of the local details of contours. Therefore, we propose an innovative multitask coupling network model (MTCN). To address the abovementioned issues, we also present different submodules. The results of extensive experiments show that the proposed method surpasses state-of-the-art methods by 2.1% and 2.5% in Boundary-AP and by 3.5% and 2.8% in Orientation AP on the PIOD and BSDS datasets, respectively, indicating that the proposed method is more advanced than comparable methods.


Introduction
When capturing object projections from a 3D scene in a 2D image, the occlusion boundaries of objects satisfy the depth discontinuities between different objects (or the object and background). Therefore, analysis of the occlusion relationship between objects from monocular images can reveal the relative depth differences between objects in the scene. Figure 1 shows an example from the BSDS300, in which two deer obstruct the lawn; their boundaries are considered occlusion edges, while their shadows are not. This method of inferring occlusion relationships between different objects from a monocular 2D image has been widely applied in many fields, including visual tracking [1,2], mobile robotics [3,4], object detection [5][6][7][8], and segmentation [9][10][11].
A standard strategy employed in the early research on occlusion relationship inferences is to combine machine-learning-based techniques with several features, including convexity, triple points, geometric context, keypoint descriptor, and image features such as Histogram of Oriented Gradient (HOG), and spectral features [12][13][14][15][16][17]. However, these methods mainly use manually designed low-level visual clue features to achieve occlusion contour detection. In addition, owing to the difficulty of defining occlusion edges and occlusion clues, as well as limited training data, the detection performance is nonideal. Recently, with the development of deep learning, convolutional neural networks (CNNs) have been applied to occlusion relationship inferences and occlusion edge detection, and have significantly improved detection performance. The DOC [18] uses an innovative approach to express occlusion relationships, where a closed contour is used to represent an object and the depth discontinuity between different objects (or between objects and the background) is represented by the direction value of contour pixels. In this way, the occlusion relationship background) is represented by the direction value of contour pixels. In this way, the occlusion relationship inference task is decomposed into occlusion edge detection and occlusion orientation prediction. In addition, two subnetworks are constructed for the two subtasks. The DOOBNet [19] uses an encoder-decoder network structure to obtain multiscale, multilevel features, where two subnetworks share the features extracted by a backbone network to achieve occlusion orientation and occlusion edge prediction. Although significant progress has been made in research on occlusion relationship inferences, two main problems have yet to be solved. First, the severe imbalance between edge and non-edge pixels in a sample can lead to the poor prediction accuracy of edges, resulting in coarse occlusion contours that require post-processing with non-maximum suppression (NMS). Second, the coupling relationship between low-level local features and high-level semantic features has not been fully explored and used, which can cause a large amount of noise in the predictions. Low-level local features are the key factors for extracting occlusion boundaries, whereas occlusion orientation prediction relies on both low-level local features and high-level occlusion clues.
Inspired by the DOC [18], DOOBNet [19], OFNet [20], and MT-ORL [21], we propose an innovative multitask coupling network (MTCN) that includes two paths: an occlusion edge extraction path and occlusion orientation prediction path. The proposed model enhances the clarity of occlusion contours by improving edge localization accuracy and suppressing trivial edge noise. In the occlusion edge extraction path, multi-supervision is used, in which each stage of a backbone network has a side output that is used for supervision. During model training, high-level semantic features guide the learning of low-level local features to obtain accurate local spatial information. Furthermore, in the orientation prediction path, we make use of the property that low-level local information represents the basis of high-level occlusion information in deep supervised networks; this allows low-level spatial information to serve as high-level semantic features to obtain high-level occlusion clues through back-propagation, thus improving the accuracy of occlusion orientation prediction. High-level semantic features from deep-side outputs preserve large receptive fields. In the proposed method, the coupling of low-level local spatial information and high-level semantic features is used to obtain multilevel feature maps from various levels, which improves the model's ability to restore local contour-detail features and remove non-occluded pixels. Moreover, the adaptive context coupling module we designed distinguishes the importance of pixels at different positions, guiding the network toward clearer contours and less surrounding noise and, in turn, further enhancing the model's ability to obtain clear contours.
In summary, the main contributions of this study are as follows: (1) An MTCN model that includes two paths, the occlusion edge extraction path and occlusion orientation prediction path, is proposed; (2) A low-level feature context integration module (LFCI) is proposed that utilizes a selfattention mechanism to distinguish the importance of pixels at different positions and obtain local detail features of objects' contours; Although significant progress has been made in research on occlusion relationship inferences, two main problems have yet to be solved. First, the severe imbalance between edge and non-edge pixels in a sample can lead to the poor prediction accuracy of edges, resulting in coarse occlusion contours that require post-processing with non-maximum suppression (NMS). Second, the coupling relationship between low-level local features and high-level semantic features has not been fully explored and used, which can cause a large amount of noise in the predictions. Low-level local features are the key factors for extracting occlusion boundaries, whereas occlusion orientation prediction relies on both low-level local features and high-level occlusion clues.
Inspired by the DOC [18], DOOBNet [19], OFNet [20], and MT-ORL [21], we propose an innovative multitask coupling network (MTCN) that includes two paths: an occlusion edge extraction path and occlusion orientation prediction path. The proposed model enhances the clarity of occlusion contours by improving edge localization accuracy and suppressing trivial edge noise. In the occlusion edge extraction path, multi-supervision is used, in which each stage of a backbone network has a side output that is used for supervision. During model training, high-level semantic features guide the learning of low-level local features to obtain accurate local spatial information. Furthermore, in the orientation prediction path, we make use of the property that low-level local information represents the basis of high-level occlusion information in deep supervised networks; this allows low-level spatial information to serve as high-level semantic features to obtain highlevel occlusion clues through back-propagation, thus improving the accuracy of occlusion orientation prediction. High-level semantic features from deep-side outputs preserve large receptive fields. In the proposed method, the coupling of low-level local spatial information and high-level semantic features is used to obtain multilevel feature maps from various levels, which improves the model's ability to restore local contour-detail features and remove non-occluded pixels. Moreover, the adaptive context coupling module we designed distinguishes the importance of pixels at different positions, guiding the network toward clearer contours and less surrounding noise and, in turn, further enhancing the model's ability to obtain clear contours.
In summary, the main contributions of this study are as follows: (1) An MTCN model that includes two paths, the occlusion edge extraction path and occlusion orientation prediction path, is proposed; (2) A low-level feature context integration module (LFCI) is proposed that utilizes a self-attention mechanism to distinguish the importance of pixels at different positions and obtain local detail features of objects' contours; (3) To enhance the receptive field of the side output features of the backbone network, we propose a multipath receptive field block (MRFB); (4) To fuse the feature flows extracted at different scales in the network structure, we propose a bilateral complementary fusion module (BCFM) that we designed; (5) Finally, to address the problem of blurred contour detection caused by a severe imbalance between the edge and non-edge pixels in data samples, we propose an adaptive multitask loss function that we designed.

Related Work
Occlusion relationship reasoning from a single image remains extremely challenging. Early work achieved success in simple domains, such as a block world [22] and line drawings [23]. The 2.1-D Sketch [24] uses a mid-level representation to express occlusion relationships. Teo et al. [3] embedded several different local cues (e.g., HOG and extremal edges) and semi-global grouping cues (e.g., Gestalt-like) into structured random forests (SRF) to detect occlusion relationships. Maire et al. [25] designed and embedded boundary relationships' representations into the segmentation depth ordering inference.
Recently, great success has been found using deep learning in occlusion relationship reasoning. The DOC [18] determines the occlusion relationships using a binary edge detector and predicts object boundaries and occlusion orientation variables. It fully learns to use local and nonlocal clues to recover occlusion relationships. The DOOBNet [19] adopts an attention loss function to improve the loss contribution of false-positive and falsenegative samples, solving the imbalance between edge and non-edge pixels. In addition, it uses an encoder structure with skip connections to obtain multi-scale and multilevel features. The OFNet [20] employs a decoder-encoder network structure with two side output paths, and it can share occlusion relationships and learn to use high-level semantic clues. Moreover, the multi-rate context learner (MCL) module can learn more clues about occlusion reasoning near the boundary. In addition, the bilateral response fusion (BRF) module can accurately determine and distinguish foreground and background areas. The MT-ORL [21] uses the OPNet network to solve the limited coupling between occlusion boundary extraction and occlusion orientation prediction. Furthermore, the orthogonal occlusion representation (OOR) can enhance occlusion relationship expression.
However, the aforementioned methods either use two independent architectures for performing the boundary extraction and orientation prediction tasks or excessively share semantic information in deep stages; thus, they do not fully use coupling between local low-level information and high-level occlusion cues. In contrast, MTCN shares feature only in the shallow stages of the decoder and adopts a multi-scale structure, as shown in Figure 2.

MTCN Model
In this section, the overall structure of the proposed network model and details of each internal module are presented. To address the imbalance between edge and nonedge pixels, we improve the loss function by combining the cross entropy, dice loss, and feature unmixing loss functions to obtain an improved adaptive loss function.×

MTCN Model
In this section, the overall structure of the proposed network model and details of each internal module are presented. To address the imbalance between edge and non-edge pixels, we improve the loss function by combining the cross entropy, dice loss, and feature unmixing loss functions to obtain an improved adaptive loss function.

Network Architecture
In the process of occlusion relationship reasoning, two important tasks are extracting complete and clear object contours and accurately predicting the orientation of occlusion edge pixels. These dense prediction tasks aim to recover pixel-level spatial details and understand high-level occlusion clues. However, there are differences in the utilization of specific contextual features; namely, contour extraction focuses on local detailed information, whereas the occlusion orientation prediction both pays attention to local detailed information and focuses on the spatial positional relationships between objects. Therefore, to consider the connection and difference between the two tasks, the two tasks share lowlevel local detail features and separate high-level semantic occlusion clues, reducing the mutual interference between the two tasks due to excessive coupling.
As shown in Figure 2, MTCN consists of an encoder and a decoder. The encoder uses a ResNet50 [26] backbone network and is composed of two mutually coupled edge detection paths and orientation prediction paths. As in previous studies, the ResNet50 [26,27] is pretrained on the ImageNet dataset. The decoder consists of the edge detector and orientation predictor, which mainly include a low-level feature context aggregation module, a BCFM, and a multipath receptive field module. The proposed model fully exploits the multi-scale features generated at different stages of the network on two paths while avoiding excessive coupling between low-level spatial positional information and high-level occlusion clues in the two subtasks.

Edge Extraction Path
In the edge detection process, extracting a clear, accurate edge is the basis for an occlusion relationship inference. The occlusion edge represents an object's position and defines the boundary position between bilateral regions. The occlusion edge requires the preserved resolution of the original image and a larger receptive field to achieve accurate positioning and perceive the mutual constraint relationship of the boundary pixels.
The structure of the edge extraction path is shown in Figure 3, which shows that the input image is propagated from the shallow stages to the deep stages along the backbone network ResNet50. Three residual modules, denoted by rest1, rest2, and rest3, output low-level feature clues, which are used to encode rich spatial information. The side outputs of the shallow stages have higher resolution and less channel information, qualities that are more conducive to learning low-level information for a specific task. The side outputs of the deep stages (i.e., rest4 and rest5) represent high-level semantic features. These deep stages contain fewer parameters and spatial details but provide rich abstract and global information for use in downstream tasks. The side output feature maps of the different resolutions generated by the encoder are input into the corresponding refinement blocks. The multipath receptive field enhancement module and BCFM are connected in a cascaded manner. The side output of each intermediate BCFM is passed through a 1 × 1 convolution to obtain the multi-scale predicted edges. Finally, all of the edges predicted by the intermediate layer are fused to construct the boundary map. The proposed structure can integrate feature maps of different scales generated in the encoding process, and semantic edges can be obtained through fusion. In the model training process, the multi-scale feature maps of the side outputs can be fully used for deep supervision learning. The generated boundary map contains both low-and high-level features, ensuring the consistency and accuracy of occlusion edges. Specifically, the edge detection path provides a complete and continuous contour, which constitutes the object region. the intermediate layer are fused to construct the boundary map. The proposed structure can integrate feature maps of different scales generated in the encoding process, and semantic edges can be obtained through fusion. In the model training process, the multiscale feature maps of the side outputs can be fully used for deep supervision learning. The generated boundary map contains both low-and high-level features, ensuring the consistency and accuracy of occlusion edges. Specifically, the edge detection path provides a complete and continuous contour, which constitutes the object region.

Orientation Prediction Path
Differently from the edge extraction path, the orientation path predicts the orientation of contour pixels to describe the occlusion relationship between the object regions. In this study, we posit that accurately obtaining contour pixels is the basis for direction prediction. Therefore, we present a fully connected structure designed from the low level to the high level to achieve position-and stage-adaptive context aggregation. As shown in Figure 3, the output features of refinement blocks 1, 2, 5, and 6 are fed to the lower-level

Orientation Prediction Path
Differently from the edge extraction path, the orientation path predicts the orientation of contour pixels to describe the occlusion relationship between the object regions. In this study, we posit that accurately obtaining contour pixels is the basis for direction prediction. Therefore, we present a fully connected structure designed from the low level to the high level to achieve position-and stage-adaptive context aggregation. As shown in Figure 3, the output features of refinement blocks 1, 2, 5, and 6 are fed to the lower-level feature context integration module, which enhances features at each stage with a context adaptively selected from the lower stages through cross-stage self-attention. Refinement blocks 1 and 2 learn from the dual-supervision signals from both the edge path and the direct path. Therefore, they can learn low-level semantic features that contain more spatial and positional information than features learned through single supervision from the edge path. Refinement blocks 5 and 6 output features that focus more on restoring relationships between different object regions. Therefore, in the proposed model, the decoder has two sub-paths sharing the shallow layers and separating the deep layers. This structure avoids excessive sharing between the two sub-paths, which can cause mutual interferences and thus helps improve the accuracy of orientation prediction.

Submodule Structure (1) Low-level Feature Context Integration Module
Inspired by [28], we propose a low-level feature context integration module that enhances features from all low-level stages by adaptively selecting the context using a cross-stage self-attention mechanism. Then, the enhanced features are up-sampled from all stages to the lowest-stage resolution and fused. The fused features contain rich, adaptively selected context information for occlusion contour restoration while also preserving highresolution detail information.
The feature map aggregation process in the context of stage i s presented in Figure 3, where the feature map of stage i is denoted as R i . Feature R j of the lower stages, where N = 4 represents the total number of stages and is used as a source of spatial position information. The adaptive context aggregation represents the concatenation of all context information found from all lower-level features and R i , which can be mathematically expressed as where S i has the same resolution as R i , and the Trans operation is adopted to fuse the cross-stage features R i and R j , which are expressed in Equation (2).
The Trans operation is presented in the dashed box in Figure 4, which shows that first, a 1 × 1 convolution and up-sampling are performed on R j to transform it into K j as a key with the same resolution as R i . Next, R i is transformed into Q i as a query using a 1 × 1 convolution. Then, the dot product is used to calculate the similarity between the elements of Q i and K j , and the result is processed by the Relu function to obtain A j,i . The scale of A j,i is jointly determined by the size values of R i and R j . As the network depth increases, R j contains larger receptive fields and more occlusion clues. Therefore, as the value of j increases, the size of A j,i decreases. Finally, the dot product of A j,i and V j provides Trans R i , R j , where V j is obtained by performing two consecutive 1 × 1 convolution and sub-pixel operations on R j . (2) BCFM Inspired by [28,29] and the following observations, we present the BCFM we designed to fuse the feature flows extracted at different scales in the network structure. This module not only restores the missing edge local details but also removes the cluttered, non-occluded pixels. The main idea behind it is to use the two-stage feature information complementarity to conduct fully bilateral edge response fusion and enhance the unified orientation fusion map of occlusion. The backbone network ResNet50 [26,27] is divided into five stages, corresponding to an input to a multipath receptive field module and obtaining side output features, as shown in Figure 2. The side output features of the receptive field module are cascaded and fed to the BCFM's input. Although the two stages in each group are similar in terms of characteristics, they have complementary clues owing to the step-by-step encoding in the CNN structure. Given a set of features containing two stages, the BCFM compensates for the details ignored in the higher-stage features due to downsampling, dilated convolution, and the restricted receptive field of the lower stages. In particular, residual connections are used inside the BCFM, as shown in Figure 4. Assuming that a set of side outputs of the receptive field is given and denoted by ( , +1 ), where and +1 are inputs that are fed to the Relu activation function to obtain ( ) and ( ( +1 )), then input +1 needs to first be passed through the subpixel convolution layer to convert the low-resolution feature map into a high-resolution one. Furthermore, to compensate for the potential response lost in ( ) and ( ( +1 )), we can calculate 1 − ( ) and 1 − ( ( +1 )). Then, these values can be separately multiplied by and +1 and passed through the 1 × 1 convolution layer. The calculation results can then be fused with the feature map from the shortcut path to obtain a set of enhanced features ( ′ , +1 ′ ). Finally, ′ and +1 ′ can be concatenated into , which is output. By aggregating the two responses in the same module using the BCFM, we can obtain the enhanced feature . The corresponding mathematical formulas are as follows: where represents the Relu function, represents the 1 × 1 convolution layer, and represents the sub-pixel upsampling operation. However, in some cases, there may be no response even when considering the complementary information from the two stages. In addition, such a well-designed convolution block eliminates the gridding artifacts caused by the dilated convolution in high-level layers [30].  For all enhanced features S i , up-sampling is performed using sub-pixels, and the features are summed to obtain the final output feature S f inal , which can be mathematically expressed as follows: where U p represents the sub-pixel convolution for up-sampling and S f inal has the same resolution as the first-stage data (i.e., half the size of the original image). The proposed low-level feature context aggregation module allows class-agnostic edges to attach appropriate occlusion clues directly, enabling the occlusion contours to stand out from the general edges. In addition, the Trans operation can integrate all spatial detail information from lower stages, thus providing sufficient local information for the orientation prediction of edge pixels.
(2) BCFM Inspired by [28,29] and the following observations, we present the BCFM we designed to fuse the feature flows extracted at different scales in the network structure. This module not only restores the missing edge local details but also removes the cluttered, non-occluded pixels. The main idea behind it is to use the two-stage feature information complementarity to conduct fully bilateral edge response fusion and enhance the unified orientation fusion map of occlusion. The backbone network ResNet50 [26,27] is divided into five stages, corresponding to an input to a multipath receptive field module and obtaining side output features, as shown in Figure 2. The side output features of the receptive field module are cascaded and fed to the BCFM's input. Although the two stages in each group are similar in terms of characteristics, they have complementary clues owing to the step-by-step encoding in the CNN structure. Given a set of features containing two stages, the BCFM compensates for the details ignored in the higher-stage features due to down-sampling, dilated convolution, and the restricted receptive field of the lower stages.
In particular, residual connections are used inside the BCFM, as shown in Figure 4. Assuming that a set of side outputs of the receptive field is given and denoted by (E i , E i+1 ), where E i and E i+1 are inputs that are fed to the Relu activation function to obtain R(E i ) and R(U p(E i+1 )), then input E i+1 needs to first be passed through the sub-pixel convolution layer to convert the low-resolution feature map into a high-resolution one. Furthermore, to compensate for the potential response lost in R(E i ) and R(U p(E i+1 )), we can calculate 1 − R(E i ) and 1 − R(U p(E i+1 )). Then, these values can be separately multiplied by E i and E i+1 and passed through the 1 × 1 convolution layer. The calculation results can then be fused with the feature map from the shortcut path to obtain a set of enhanced features E i , E i+1 . Finally, E i and E i+1 can be concatenated into E f , which is output. By aggregating the two responses in the same module using the BCFM, we can obtain the enhanced feature E f . The corresponding mathematical formulas are as follows: where R represents the Relu function, Conv represents the 1 × 1 convolution layer, and U p represents the sub-pixel upsampling operation. However, in some cases, there may be no response even when considering the complementary information from the two stages. In addition, such a well-designed convolution block eliminates the gridding artifacts caused by the dilated convolution in high-level layers [30].

(3) Multipath Receptive Field Block
Owing to the reasoning of occlusion relationships between different objects, accurate order prediction of the foreground and background regions is required. Therefore, an embedded receptive field enhancement module we designed is implemented into the network structure to perceive information on object edges. Inspired by the work presented in [31], we propose the MRFB. The MRFB structure includes several parts: a multibranch convolution layer using different convolution kernels, a dilation convolution layer at the end of each branch (used to form constraints between the contour pixels), and a shortcut branch, as shown in Figure 5.
Specifically, in each branch, the bottleneck structure, which consists of a 1 × 1 convolution layer, is used to integrate various channel clues and promote bilateral feature aggregation across channels at the same position. The central branch consists of two 3 × 3 convolution modules, and the other two branches consist of two 1 × 3 and two 3 × 1 strip convolution modules. The two stripe convolution paths simultaneously increase the receptive field and feature weight of the central pixel in the orthogonal direction, thus accumulating more orthogonal signals with more information for the subsequent prediction of the occlusion direction. At the end of each branch, dilation convolution with a different ratio is used, which can capture richer contextual information while maintaining the same number of parameters. In addition, the shortcut path design from ResNet [26] is used. Finally, the feature maps output by the branches are concatenated and then passed through a 1 × 1 convolutional layer. The resulting feature map is then fused with the feature maps from the shortcut path.

(4) Multitask Loss Function
For the proposed CNN-based multitask separation network model, it is necessary to achieve accurate extraction of occlusion contours and prediction of tangential directions of contour pixels simultaneously. Therefore, the loss function should be derived considering the following three aspects: the fusion of multitask loss functions, the accuracy of detection results, and the severe imbalance between the contour and non-contour pixels. Inspired by previous work [32,33], we propose a multitask loss function. The total loss L is calculated as follows: where L e and L o represent the loss functions of the contour extraction path and the orientation prediction path, respectively; ω e and ω o represent the weights of L e and L o , respectively.
In the contour extraction path, four side contour feature maps and a fused contour feature map P s−1 ,P s−2 , · · · ,P s−κ ,P f , are obtained. The contour extraction loss function L e P , Y is defined as follows: where κ = 4; ω f and ω s−i are the weights of the fused contour loss function and the side contour loss functions, respectively;P = (P i , i = 1, 2, · · · , |I|), where P i (1, 0) represents the predicted contour feature map; and Y is the ground-truth label value.
To obtain accurate occlusion contours, we use an adaptive loss function L P , Y of the BMRN proposed in [29], which is defined as L(P, is the dice coefficient loss function of the LPCB proposed by Deng et al. [31]; and L C (P, Y) is the standard cross-entropy loss function of the HED proposed in [32]. All loss functions calculate all pixels in the corresponding feature map. Furthermore, L o is the loss function of the orientation path, which calculates only the ground-truth boundary pixels; L o is the orthogonal orientation regression loss (OOR) defined in the MT-ORL [21], which is defined as follows: , whereâ andb represent the predicted values of the network model, and θ is the ground truth of the occlusion orientation. Owing to the reasoning of occlusion relationships between different objects, accurate order prediction of the foreground and background regions is required. Therefore, an embedded receptive field enhancement module we designed is implemented into the network structure to perceive information on object edges. Inspired by the work presented in [31], we propose the MRFB. The MRFB structure includes several parts: a multibranch convolution layer using different convolution kernels, a dilation convolution layer at the end of each branch (used to form constraints between the contour pixels), and a shortcut branch, as shown in Figure 5.
Specifically, in each branch, the bottleneck structure, which consists of a 1 × 1 convolution layer, is used to integrate various channel clues and promote bilateral feature aggregation across channels at the same position. The central branch consists of two 3 × 3 convolution modules, and the other two branches consist of two 1 × 3 and two 3 × 1 strip convolution modules. The two stripe convolution paths simultaneously increase the receptive field and feature weight of the central pixel in the orthogonal direction, thus accumulating more orthogonal signals with more information for the subsequent prediction of the occlusion direction. At the end of each branch, dilation convolution with a different ratio is used, which can capture richer contextual information while maintaining the same number of parameters. In addition, the shortcut path design from ResNet [26] is used. Finally, the feature maps output by the branches are concatenated and then passed through a 1 × 1 convolutional layer. The resulting feature map is then fused with the feature maps from the shortcut path. For the proposed CNN-based multitask separation network model, it is necessary to achieve accurate extraction of occlusion contours and prediction of tangential directions of contour pixels simultaneously. Therefore, the loss function should be derived considering the following three aspects: the fusion of multitask loss functions, the accuracy of

Results and Discussion
The performance of the proposed network model was verified by a large number of experiments. In addition, the design choices of the network model were validated through ablation experiments.

Datasets and Implementation Details
The proposed model was evaluated on two public relationship datasets, the PIOD [23] and BSDS [22] datasets. The PIOD dataset includes 9175 training images and 925 test images; in this dataset, each original image corresponds to two labels: the object occlusion contour label and the occlusion orientation label. The BSDS ownership dataset contains 100 training images and 100 test images. Similarly, to the PIOD dataset, each original image in the BSDS dataset corresponds to two labels: the occlusion object contour label and the occlusion orientation label. Subsequently, during the training process, all images were randomly divided into two datasets containing images cropped to a resolution of 320 × 320 px, and the original size remained unchanged during the testing process.
(1) Implementation Details The proposed network model was evaluated on the current mainstream framework PyTorch, and AdamW was used as an optimizer for the network model training. The backbone encoder was the ResNet50 [26,27] architecture pre-trained on the ImageNet dataset. The other hyper-parameters were set as follows: the minimum batch size was 10; the global learning rate, was 3 × 10 −4 ; and the momentum, was 0.9. The PIOD and BSDS datasets were iteratively trained 40,000 and 20,000 times, respectively, with a 10-fold decrease in the learning rate every 1000 iterations. ω s− [1−4] , ω e , ω o , λ, and θ were set to 1.1, 2.1, 1.7, 1.0, and 1.0, respectively.

(2) Evaluation Metrics
During the testing phase, the precision and recall of the proposed model were calculated. For edge extraction and occlusion orientation prediction, the boundary precision recall (BPR) and orientation precision recall (OPR) were used to represent the corresponding accuracy and recall of the edge and occlusion orientation, respectively. The three evaluation criteria were calculated from the PR fixed contour threshold (ODS), best threshold of an image (OIS), and average precision (AP). In the comparison experiments, the B-and O-Metric were used to represent the metrics related to the BPR and OPR, respectively. Note that only the correctly detected contour pixels were calculated for the OPR.

Ablation Experiments
We analyzed the effect of the proposed algorithm's sub-module on the performance of the overall network model through ablation experiments, using the BSDS500 training and verification sets to train the network model and the test set to evaluate model performance. Table 1 shows the impact of different modules on the algorithm performance in the proposed network model. LFCI had the greatest impact on the overall model performance, while BCFM had the smallest impact on the model performance.

(1) Performance Comparison on BSDS Dataset
Training the model on the challenging BSDS dataset, with limited training samples, is difficult. Still, on this dataset, the proposed method could achieve better performance than the other methods in terms of the BPR and OPR, as shown in Figure 6a and Figure 6b, respectively. Furthermore, as shown in Table 2, the proposed method achieved the best performance in terms of ODS, OIS, and AP among all methods, improving the B-AP and O-AP values by 2.1% and 5.2%, respectively, compared with the state-of-the-art methods. Moreover, our running efficiency was also the highest, reaching 33.2 FPS. The results validate the effectiveness of the proposed method in both occlusion orientation prediction and contour extraction.  The precision-recall curves of different methods in the contour extraction and occlusion orientation prediction tasks on the PIOD dataset are shown in Figure 7a and Figure  7b, respectively. The results show that the proposed method achieved significant improvements in both precision and recall compared with the other methods. The experimental results of the ODS, OIS, and AP metrics of different methods are presented in Table  3, which shows that the proposed method outperformed the MT-ORL method. Furthermore, the proposed method achieved the shortest running time and the highest efficiency (34.6 FPS) on the PIOD test set.  Table 3, which shows that the proposed method outperformed the MT-ORL method. Furthermore, the proposed method achieved the shortest running time and the highest efficiency (34.6 FPS) on the PIOD test set.   Finally, the contour and occlusion relationship maps were constructed. The results of the experiments on multiple images from the BSDS and PIOD datasets show that the contours extracted by the proposed method were more complete and clearer than those of the DOOBNet [19] and MT-ORL [21] methods, as shown in Figures 8 and 9. In addition, as shown in Figure 8, in a large area of the solid-colored scene, the proposed method had a sufficient receptive field, as opposed to a small receptive field, making it difficult to perceive a large area of solid-colored objects and correctly predict this relationship. Furthermore, many foreground objects appeared in the scene, and the edge occlusion relationships were complex and difficult to distinguish, as shown in Figure 9. However, the proposed method could accurately detect the occluded edges, showing stronger generalization potential compared with the other methods.
In summary, the proposed method could significantly outperform other methods in edge extraction and occlusion direction prediction, thus validating its effectiveness. In Figures 8 and 9, the pixels on the right side of the arrows represent the image background, and those on the left side represent the foreground image. The "red" arrow pixels indicate the true occlusion direction and the boundary marked by "cyan" pixels is correct, but the occlusion is incorrect. Moreover, the "green" pixels indicate false-negative boundaries, while the "orange" pixels denote false-positive boundaries.  Finally, the contour and occlusion relationship maps were constructed. The results of the experiments on multiple images from the BSDS and PIOD datasets show that the contours extracted by the proposed method were more complete and clearer than those of the DOOBNet [19] and MT-ORL [21] methods, as shown in Figures 8 and 9. In addition, as shown in Figure 8, in a large area of the solid-colored scene, the proposed method had a sufficient receptive field, as opposed to a small receptive field, making it difficult to perceive a large area of solid-colored objects and correctly predict this relationship. Furthermore, many foreground objects appeared in the scene, and the edge occlusion relationships were complex and difficult to distinguish, as shown in Figure 9. However, the proposed method could accurately detect the occluded edges, showing stronger generalization potential compared with the other methods.
In summary, the proposed method could significantly outperform other methods in edge extraction and occlusion direction prediction, thus validating its effectiveness. In Figures 8 and 9, the pixels on the right side of the arrows represent the image background, and those on the left side represent the foreground image. The "red" arrow pixels indicate the true occlusion direction and the boundary marked by "cyan" pixels is correct, but the occlusion is incorrect. Moreover, the "green" pixels indicate false-negative boundaries, while the "orange" pixels denote false-positive boundaries. Electronics 2023, 12, x FOR PEER REVIEW 13 of 15

Conclusions
This paper proposed an innovative network model, called MTCN, that includes two subtasks, extracting occluded object contours and predicting occlusion directions, by constructing the corresponding sub-paths. In addition, two sub-paths are constructed to share shallow stages and separate deep stages in the network structure. Moreover, the BCFMs, LFCI, and a multipath receptive field module are incorporated into the proposed network structure. During the training of the proposed model, the two sub-paths can either be trained separately or combined. The proposed method was verified by experiments, and the experimental results show the superiority of the proposed method compared to state-of-the-art methods on the BSDS and PIOD datasets.