Part-Based Obstacle Detection Using a Multiple Output Neural Network

Detecting the objects surrounding a moving vehicle is essential for autonomous driving and for any kind of advanced driving assistance system; such a system can also be used for analyzing the surrounding traffic as the vehicle moves. The most popular techniques for object detection are based on image processing; in recent years, they have become increasingly focused on artificial intelligence. Systems using monocular vision are increasingly popular for driving assistance, as they do not require complex calibration and setup. The lack of three-dimensional data is compensated for by the efficient and accurate classification of the input image pixels. The detected objects are usually identified as cuboids in the 3D space, or as rectangles in the image space. Recently, instance segmentation techniques have been developed that are able to identify the freeform set of pixels that form an individual object, using complex convolutional neural networks (CNNs). This paper presents an alternative to these instance segmentation networks, combining much simpler semantic segmentation networks with light, geometrical post-processing techniques, to achieve instance segmentation results. The semantic segmentation network produces four semantic labels that identify the quarters of the individual objects: top left, top right, bottom left, and bottom right. These pixels are grouped into connected regions, based on their proximity and their position with respect to the whole object. Each quarter is used to generate a complete object hypothesis, which is then scored according to object pixel fitness. The individual homogeneous regions extracted from the labeled pixels are then assigned to the best-fitted rectangles, leading to complete and freeform identification of the pixels of individual objects. The accuracy is similar to instance segmentation-based methods but with reduced complexity in terms of trainable parameters, which leads to a reduced demand for computational resources.


Introduction
Artificial intelligence-based image processing has become popular in recent years and has improved the development of solutions in multiple fields, most notably in those that rely on computer-based vision. Deep learning systems feature in various real-life applications, making use of large publicly available datasets and of the increased processing power now available. Most notably, deep learning-based computer vision has been used to predict relevant information, with potential uses in various fields, such as medical imaging, piloting autonomous vehicles, robots or platforms, surveillance, and so on. For the prediction task, much of the work has been focused on developing convolutional neural networks (CNNs) that usually outperform the traditional image-processing algorithms.
Of all the possible uses for CNN-based computer vision, the field of automotive applications, which not only includes autonomous vehicles but also other advanced driving assistance systems, has benefited from special attention in recent years. This field is significant from the computer vision point of view because the environment to be perceived is dynamic, unpredictable, and complex, and the image perception must be performed in real time using limited computer architectures.

1.
A multiple-head network architecture that is able to produce the geometric parts of the objects, which can be then grouped into individual instances, free and occupied pixel classification for results refinement, and voting maps for vanishing point computation; 2.
A lightweight object-clustering algorithm, based on the results of the multiple-head semantic segmentation network; 3.
An automated training solution for the multiple-head network, which uses publicly available databases without the need for manual annotation of the object parts.

Multi-Task Deep Learning
Deep learning-based solutions have been widely used for different tasks in recent years, most notably in approaches based on convolutional neural networks that are used for the classification of images [1], semantic image segmentation [2][3][4], object detection [5,6], or even tracking [7,8]. Different methods using deep learning for autonomous vehicles and driving have been presented elsewhere, in a survey [9]. In this paper, we focus on camera-based image processing; work using active sensors and sensor fusion has been also presented elsewhere and represents an active field of study [10,11].
Recent developments have been proposed in order to provide artificial neural networks that are able to perform multiple prediction tasks. The advantage of these solutions is that they improve computational efficiency and have a reduced memory footprint; however, the main drawback is that they are harder to train as they rely heavily on accurate and robust training datasets, which are laborious to produce. The multiple predictions of a single network are harder to inspect and debug. Still, multi-output neural networks have been developed; they proved to be scalable and easily expandable, meaning that an existing multi-task network can be expanded to predict new information as needed, including optical flow or tracking, scene depth, and so on. This is accomplished by reusing the network's shared extracted features. Training multi-output networks can sometimes be challenging, due to the fact that the various outputs require different loss functions that need to be weighted. This process can be time-consuming and requires fine-tuning. Multi-task artificial networks improve learning by sharing the specific features of each task.
MultiNet [12,13] is a network that performs multiple predictions, whereas the work presented in [14] refers to a network that uses the YUV color space for the input images and can also produce multiple predictions. Elsewhere, a network that predicts semantic, instance, and depth information from a traffic scene has been presented in [15].

Object Detection
In the context of autonomous vehicles and driving, detecting objects can be performed with images acquired using a single camera [16] or a stereo-vision setup [17]. The detection task can also be achieved by using multiple input sources and sensorial data, such as lidar [10,18] or radar [19]. An overview of the multiple sensors used in autonomous vehicles is presented in [20]. The accuracy and robustness of a self-driving vehicle can be increased by using multiple sensors and fusing the data, but the main drawback is that these sensors require specific calibration and synchronization. The calibration and fusion of data add complexity to such a system, as well as additional costs.
One of the most convenient and affordable sources of information is the single camera, which is able to produce grayscale or color image streams. Single-camera systems are easy to set up and calibrate, and the lack of 3D information can be compensated for by increasingly complex and accurate algorithms.
Traditional object detectors have used image-processing-based algorithms, such as Viola-Jones [21] or a histogram of oriented gradients [22], whereas the modern object detection approaches are based on artificial learning, more specifically, convolutional neural networks. Some approaches fuse traditional and deep-learning methods for object detection [23]. Detecting objects is an extension of object classification, with the main aim of estimating the localization of all instances of certain objects from an input image. Recent detection methods that are based on artificial neural networks feature two main modules: a pre-trained backbone network and a head that outputs the object's class, along with its bounding box coordinates. The backbone networks are used to extract the relevant features from the input images. Detection neural networks are trained using large amounts of labeled data; the detection process is implemented either in a single-stage (one-stage) or a two-stage approach.
Single-stage, also called one-stage, detectors localize objects and classify them in a single shot, by performing a regression of the bounding boxes, using predefined boxes and keypoints with multiple aspect ratios and scales.
Yolo [5], with its variations and improvements, and Single-Shot Detector (SSD) [6] are examples of one-stage detectors using CNNs. The authors of Yolo treated the detection problem as a regression problem, meaning that the CNN predicts the image pixels as the object, along with its bounding box coordinates and its class. Drawbacks regarding the localization of small objects from the original paper were fixed in the newer versions of Yolo [24], as well as by using different backbone networks, such as DarkNet [25]. Single-Shot Multibox Detector (SSD) used VGG-Net [26] as the backbone initially and matched the accuracy of some two-stage detectors. Limitations regarding small objects were fixed using ResNet [1] as the backbone. One-stage detectors usually achieve a very high frame rate, performing the detection in real time.
Two-stage detectors feature a separate module for region proposals, meaning that they are more complex in design and require more time for predictions. In the first step, the two-stage detectors generate regions of interest from the input image and use them in the second step to regress the results to object classes and bounding boxes. Usually, they offer better accuracy but lack real-time performance. R-CNN [27], and its further improvements (Faster R-CNN [28]), represent two-stage detectors that use CNNs to classify and locate objects within images. The input image is fed into a region proposal module that outputs object predictions (candidates), which are then used in a second CNN module that extracts the object class and bounding box. These networks are usually more complex than one-stage detectors and need greater computational time for predictions. Mask R-CNN [29] extends the previous two-stage detectors, also performing instance segmentation by adding a mask head parallel to the bounding box and classification head. This network approach features good accuracy but lacks real-time performance. However, Mask R-CNN has remained one of the fastest networks for instance segmentation with challenging datasets. Other networks have been proposed that are faster and make use of techniques from object detectors, applying them to instance segmentation: Mask-YOLO [30] and YOLACT [31]. The YOLACT network divides the instance segmentation task into two parallel tasks to achieve real-time performance. A comprehensive survey regarding 2D image segmentation techniques using deep learning is presented in [32].

Solution Description
The core of the solution is the multiple-head artificial neural network, based on an encoder-decoder structure that uses a color image as input, whereas the output consists of multiple prediction modules: an obstacle detection module, a semantic segmentation module, and a vanishing-point detection module. The input part, based on an encoder, will extract the proper and significant features from the input image, whereas the decoder, which is based on semantic segmentation solutions, will provide multiple predictions (outputs). We have trained each module independently and we have used the same loss function. The CNN architecture is presented in Figure 1.
networks are usually more complex than one-stage detectors and need greater computational time for predictions. Mask R-CNN [29] extends the previous two-stage detectors, also performing instance segmentation by adding a mask head parallel to the bounding box and classification head. This network approach features good accuracy but lacks realtime performance. However, Mask R-CNN has remained one of the fastest networks for instance segmentation with challenging datasets. Other networks have been proposed that are faster and make use of techniques from object detectors, applying them to instance segmentation: Mask-YOLO [30] and YOLACT [31]. The YOLACT network divides the instance segmentation task into two parallel tasks to achieve real-time performance. A comprehensive survey regarding 2D image segmentation techniques using deep learning is presented in [32].

Solution Description
The core of the solution is the multiple-head artificial neural network, based on an encoder-decoder structure that uses a color image as input, whereas the output consists of multiple prediction modules: an obstacle detection module, a semantic segmentation module, and a vanishing-point detection module. The input part, based on an encoder, will extract the proper and significant features from the input image, whereas the decoder, which is based on semantic segmentation solutions, will provide multiple predictions (outputs). We have trained each module independently and we have used the same loss function. The CNN architecture is presented in Figure 1. The detection module will identify individual objects or obstacles from the road scene, including vehicles, trucks, buses, cyclists, pedestrians, etc. This is based on semantic segmentation approaches; more specifically, on a modified U-Net CNN decoder module. The method of extracting the individual object instances is unique, meaning that we made use of the semantic segmentation CNN to label the image pixels with the corresponding object quarters (parts), instead of using it to label free space or specific obstacles. The labeled object parts were then grouped further into individual objects, using post- The detection module will identify individual objects or obstacles from the road scene, including vehicles, trucks, buses, cyclists, pedestrians, etc. This is based on semantic segmentation approaches; more specifically, on a modified U-Net CNN decoder module. The method of extracting the individual object instances is unique, meaning that we made use of the semantic segmentation CNN to label the image pixels with the corresponding object quarters (parts), instead of using it to label free space or specific obstacles. The labeled object parts were then grouped further into individual objects, using post-processing that took into account the proximity and position of the quarters as part of a full object. This approach simplifies the overall artificial network architecture by making use of the direct connections with the decoder's layers. Another advantage of a multiple output network is that the encoder part is shared between the modules and the model can easily be extended, to predict other information. The detection module output features four layers (channels) that encode the binary status of the image pixels as being part of an object quarter (top left, top right, bottom left, bottom right). Using these parts, the algorithm described in Section 3.4 is able to extract the object instances as bounding boxes and as the labeled regions of pixels.

Feature Extraction
The extraction of the relevant pixel-based features from the input images is achieved by the first part of the artificial network; its structure is based on the ResNet neural network architecture [1], which has proved to be very effective and is widely used. It gained recognition after it won the ImageNet competition in the past; it also introduced the concept of skip connections between the layers of the neural network. These connections were the novel part of the system and helped to improve the performance of the CNN. In this work, we make use of a modified version, which is called ResNet-50 (see Figure 2). output network is that the encoder part is shared between the modules and the model can easily be extended, to predict other information. The detection module output features four layers (channels) that encode the binary status of the image pixels as being part of an object quarter (top left, top right, bottom left, bottom right). Using these parts, the algorithm described in Section 3.4 is able to extract the object instances as bounding boxes and as the labeled regions of pixels.

Feature Extraction
The extraction of the relevant pixel-based features from the input images is achieved by the first part of the artificial network; its structure is based on the ResNet neural network architecture [1], which has proved to be very effective and is widely used. It gained recognition after it won the ImageNet competition in the past; it also introduced the concept of skip connections between the layers of the neural network. These connections were the novel part of the system and helped to improve the performance of the CNN. In this work, we make use of a modified version, which is called ResNet-50 (see Figure 2). Each "conv block" is characterized by the following three operations: a 2D convolution, a batch normalization operation, and ReLU activation. These are performed three times, with a different number of filters and kernel sizes. This block is then combined with the result of running another 2D convolution. The "identity block" is similar, meaning that it has the same three operations that are also performed three times, but the skip connection is performed with the input tensor, rather than having the extra convolution at the end.

Decoder Structure
The part responsible for decoding the extracted relevant features is the decoder. This is based on the well-known U-Net architecture [4], which is able to perform well, even with small training datasets. The structure of the decoder is illustrated in Figure 3. The concatenation operations are paired with the corresponding layers from the encoder part, whereas the output of the network is given by the final convolution operation. Each "conv block" is characterized by the following three operations: a 2D convolution, a batch normalization operation, and ReLU activation. These are performed three times, with a different number of filters and kernel sizes. This block is then combined with the result of running another 2D convolution. The "identity block" is similar, meaning that it has the same three operations that are also performed three times, but the skip connection is performed with the input tensor, rather than having the extra convolution at the end.

Decoder Structure
The part responsible for decoding the extracted relevant features is the decoder. This is based on the well-known U-Net architecture [4], which is able to perform well, even with small training datasets. The structure of the decoder is illustrated in Figure 3. The concatenation operations are paired with the corresponding layers from the encoder part, whereas the output of the network is given by the final convolution operation.  The decoder makes use of multiple operations that are performed in order to obtain the desired output.
Our module features a central convolutional layer, followed by three upsampling layers. The central layer contains a 2D convolution operation and a batch normalization, followed by ReLU activation. The three upsampling layers have the following structure: 2D upsampling, concatenation (with the corresponding encoder layer), zero padding, 2D convolution, and a batch normalization operation. The final part from the semantic segmentation module has an additional convolution that represents the final segmentation output (the segmentation map), with the same number of layers as the desired number of predicted object classes.
The encoder-decoder structure of the artificial neural network is presented in Figure  4, where the concatenate connections are more clearly illustrated. The decoder makes use of multiple operations that are performed in order to obtain the desired output.
Our module features a central convolutional layer, followed by three upsampling layers. The central layer contains a 2D convolution operation and a batch normalization, followed by ReLU activation. The three upsampling layers have the following structure: 2D upsampling, concatenation (with the corresponding encoder layer), zero padding, 2D convolution, and a batch normalization operation. The final part from the semantic segmentation module has an additional convolution that represents the final segmentation output (the segmentation map), with the same number of layers as the desired number of predicted object classes. The encoder-decoder structure of the artificial neural network is presented in Figure 4, where the concatenate connections are more clearly illustrated.
followed by ReLU activation. The three upsampling layers have the following structu 2D upsampling, concatenation (with the corresponding encoder layer), zero padding, convolution, and a batch normalization operation. The final part from the semantic s mentation module has an additional convolution that represents the final segmentat output (the segmentation map), with the same number of layers as the desired numbe predicted object classes.
The encoder-decoder structure of the artificial neural network is presented in Fig  4, where the concatenate connections are more clearly illustrated.

Global Semantic Segmentation
The neural network output (or heads) represents the decoder part that reconstructs the semantic segmentation and is based on the U-Net CNN. The network proposed by us features a center layer and three upsampling layers that are further concatenated with their correlated layers from the ResNet-based encoder, as described in the previous section.
The relevant features, extracted using the ResNet encoder, are used in the reconstruction layers of the U-Net decoder to produce an image with three channels, each channel representing the desired segmentation classes. In this proposed work, we predict three different classes: the road (free space), dynamic objects, and static objects. Therefore, the first output channel of the final convolutional layer will depict the drivable road area, while the second channel will represent the dynamic (moving) objects from the road, including pedestrians, cyclists, vehicles, buses, and trucks. The final channel of the output will be the static objects; more specifically, we chose sidewalks and the lane delimiters (road barriers or fences). An example is presented in Figure 5.
representing the desired segmentation classes. In this proposed work, we predict three different classes: the road (free space), dynamic objects, and static objects. Therefore, the first output channel of the final convolutional layer will depict the drivable road area, while the second channel will represent the dynamic (moving) objects from the road, including pedestrians, cyclists, vehicles, buses, and trucks. The final channel of the output will be the static objects; more specifically, we chose sidewalks and the lane delimiters (road barriers or fences). An example is presented in Figure 5. Figure 5. Examples of the global semantic segmentation output: the first image is the color input image of the road scene, while the second image represents the drivable road area, the third represents the dynamic objects, and the fourth image is static objects (sidewalk or fence).

Obstacle Reconstruction Using Part-Based Semantic Segmentation
The detected obstacles from the decoder module will be reconstructed using the methodology presented in this section. The CNN will provide an output wherein the obstacles are split into four individual quarters (parts), each represented by a binary channel of the output image (see Figure 6). Figure 5. Examples of the global semantic segmentation output: the first image is the color input image of the road scene, while the second image represents the drivable road area, the third represents the dynamic objects, and the fourth image is static objects (sidewalk or fence).

Obstacle Reconstruction Using Part-Based Semantic Segmentation
The detected obstacles from the decoder module will be reconstructed using the methodology presented in this section. The CNN will provide an output wherein the obstacles are split into four individual quarters (parts), each represented by a binary channel of the output image (see Figure 6). Based on the semantic segmentation results of the neural network, which labels each obstacle pixel with a "quarter" label, we have designed an algorithm to extract the individual objects. First, each pixel of the whole image space is labeled with a 4-bit code, each bit corresponding to a quarter that overlaps the pixel. Some pixels can be overlapped by more than one quarter, as the four semantic segmentation images produced by the CNN are not mutually exclusive (see Figure 7). The meaning of the four bits is as follows: Based on the semantic segmentation results of the neural network, which labels each obstacle pixel with a "quarter" label, we have designed an algorithm to extract the individual objects. First, each pixel of the whole image space is labeled with a 4-bit code, each bit corresponding to a quarter that overlaps the pixel. Some pixels can be overlapped by more than one quarter, as the four semantic segmentation images produced by the CNN are not mutually exclusive (see Figure 7). The meaning of the four bits is as follows: more than one quarter, as the four semantic segmentation images produced by the CNN are not mutually exclusive (see Figure 7). The meaning of the four bits is as follows:  The coded pixels will have a value of 0 if they belong to the free space (they are not obstacle points), or have a value between 1 and 15 if they belong to an object. For each of the 15-obstacle pixel values, a binary image will be generated, and the image will be labeled using the connected component labeling algorithm. The labels for each of the 15 images will be unique (the labels for image 2 will start from the maximum value of the labels of image 1, plus 1, and so on), meaning that at the end, they can be joined together into a common label image, as shown in Figure 8. The resulting regions are similar to the superpixels described in [11], but these have the added advantage that they are the result of semantic segmentation and have a high probability that each region belongs to a single object-they are grouped by meaning and not simply by color or texture properties, as in the case of superpixels. The coded pixels will have a value of 0 if they belong to the free space (they are not obstacle points), or have a value between 1 and 15 if they belong to an object. For each of the 15-obstacle pixel values, a binary image will be generated, and the image will be labeled using the connected component labeling algorithm. The labels for each of the 15 images will be unique (the labels for image 2 will start from the maximum value of the labels of image 1, plus 1, and so on), meaning that at the end, they can be joined together into a common label image, as shown in Figure 8. The resulting regions are similar to the superpixels described in [11], but these have the added advantage that they are the result of semantic segmentation and have a high probability that each region belongs to a single object-they are grouped by meaning and not simply by color or texture properties, as in the case of superpixels. The problem is now re-stated as the problem of assigning a unique object identifier to each region. Basically, the problem becomes a problem of region-grouping but, again, we can use the advantage of semantics because we can restrict the position of the region inside the whole object, based on the quarter codes.
The grouping of the regions will be based on generating multiple object hypotheses, based on the individual quarters. Because we have four types of quarters, we can generate a maximum of four hypotheses for each real-world object. Each quarter will generate a complete object by extending its size to cover the other three missing quarters, assuming that the quarters will be of a similar size. For example, a bottom-left quarter can be extended upwards and to the right to generate the possible complete object to which the The problem is now re-stated as the problem of assigning a unique object identifier to each region. Basically, the problem becomes a problem of region-grouping but, again, we can use the advantage of semantics because we can restrict the position of the region inside the whole object, based on the quarter codes.
The grouping of the regions will be based on generating multiple object hypotheses, based on the individual quarters. Because we have four types of quarters, we can generate a maximum of four hypotheses for each real-world object. Each quarter will generate a complete object by extending its size to cover the other three missing quarters, assuming that the quarters will be of a similar size. For example, a bottom-left quarter can be extended upwards and to the right to generate the possible complete object to which the quarter belongs. If the object is completely visible (not covered by another object and not at the edge of the image), four complete rectangular hypotheses will be generated, as seen in Figure 9. Due to the fact that the hypotheses outnumber the objects by a factor of 4 to 1, the algorithm will compute a pixel fitness score for all hypotheses, so that the best-fitting one can be selected. The pixel score S(R) for each region R is computed by counting the pixels overlapping the region that fit with their corresponding quarter: the quarter defined by the region must correspond to the quarter label assigned to the pixel by the CNN-based semantic segmentation. When considering a matched pixel, we will consider the bestsuited segmentation quarter label, so if a pixel has two labels (for example, a top-left label and a bottom-left label) and belongs to a top-left quarter of the hypothesis region, it is considered to be a match. This process is depicted in Figure 10.
The pixel score is then normalized with the region area, so all pixel scores S(R) will belong within the interval (0…1).  Due to the fact that the hypotheses outnumber the objects by a factor of 4 to 1, the algorithm will compute a pixel fitness score for all hypotheses, so that the best-fitting one can be selected. The pixel score S(R) for each region R is computed by counting the pixels overlapping the region that fit with their corresponding quarter: the quarter defined by the region must correspond to the quarter label assigned to the pixel by the CNN-based semantic segmentation. When considering a matched pixel, we will consider the best-suited segmentation quarter label, so if a pixel has two labels (for example, a top-left label and a bottom-left label) and belongs to a top-left quarter of the hypothesis region, it is considered to be a match. This process is depicted in Figure 10. Due to the fact that the hypotheses outnumber the objects by a factor of 4 to 1, the algorithm will compute a pixel fitness score for all hypotheses, so that the best-fitting one can be selected. The pixel score S(R) for each region R is computed by counting the pixels overlapping the region that fit with their corresponding quarter: the quarter defined by the region must correspond to the quarter label assigned to the pixel by the CNN-based semantic segmentation. When considering a matched pixel, we will consider the bestsuited segmentation quarter label, so if a pixel has two labels (for example, a top-left label and a bottom-left label) and belongs to a top-left quarter of the hypothesis region, it is considered to be a match. This process is depicted in Figure 10.
The pixel score is then normalized with the region area, so all pixel scores S(R) will belong within the interval (0…1).   The pixel score is then normalized with the region area, so all pixel scores S(R) will belong within the interval (0 . . . 1).
The next step of the algorithm is to establish the dependency relationships between the hypotheses. An overlapping score is computed between any two rectangles Ri and Rj. If the overlapping score exceeds a threshold of 0.5, and the pixel score of Rj is higher than the pixel score of Ri, the rectangle Ri will be labeled as being dependent on Rj. Basically, we assume that the region Ri depicts the same object as Rj but is a less adequate fit.
This process is depicted in Figure 11. The four rectangles are generated by the pixelderived quarters, but the best fit is the rectangle generated by the top-right quarter. Therefore, all the other rectangles are labeled as being dependent on the rectangle hypothesis generated by the top-right quarter. The next step of the algorithm is to establish the dependency relationships between the hypotheses. An overlapping score is computed between any two rectangles Ri and Rj. If the overlapping score exceeds a threshold of 0.5, and the pixel score of Rj is higher than the pixel score of Ri, the rectangle Ri will be labeled as being dependent on Rj. Basically, we assume that the region Ri depicts the same object as Rj but is a less adequate fit.
This process is depicted in Figure 11. The four rectangles are generated by the pixelderived quarters, but the best fit is the rectangle generated by the top-right quarter. Therefore, all the other rectangles are labeled as being dependent on the rectangle hypothesis generated by the top-right quarter. Figure 11. Re-labeling the rectangle hypothesis by the best-scored overlap.
Based on the rectangle hypotheses and their dependency relations, the individual regions presented in Figure 8 will receive the final label. This new label is the identity of the individual object of which the region is a part.
For each region, the number of pixels overlapping every rectangle hypothesis is counted. All rectangles are considered in this step, even those that are in a dependency relationship. The process has two stages: 1. The rectangle that overlaps the most pixels of the individual region is selected. 2. If the rectangle is dependent on another rectangle (as seen in Figure 11), the label of the main rectangle is transferred to the individual region.
The final step of the algorithm is to label the pixels themselves with the identity of the object to which they belong. For this step, we simply transfer to the individual pixel the label of the region to which the pixel belongs, as this region has already been labeled with the identity of the object.
The entire process is illustrated in Figure 12, where two partially overlapping objects are presented, together with their final label. The complete processing steps are described in Algorithm 1. Based on the rectangle hypotheses and their dependency relations, the individual regions presented in Figure 8 will receive the final label. This new label is the identity of the individual object of which the region is a part.
For each region, the number of pixels overlapping every rectangle hypothesis is counted. All rectangles are considered in this step, even those that are in a dependency relationship. The process has two stages:

1.
The rectangle that overlaps the most pixels of the individual region is selected.

2.
If the rectangle is dependent on another rectangle (as seen in Figure 11), the label of the main rectangle is transferred to the individual region.
The final step of the algorithm is to label the pixels themselves with the identity of the object to which they belong. For this step, we simply transfer to the individual pixel the label of the region to which the pixel belongs, as this region has already been labeled with the identity of the object.
The entire process is illustrated in Figure 12, where two partially overlapping objects are presented, together with their final label. The complete processing steps are described in Algorithm 1.

Object Refinement
We provide an additional post-processing step in order to improve the quality of the detected objects. The dynamic objects from the global segmentation head (described in Section 3.3) are used to improve the four quarters from the object detection head. Therefore, the semantic output is used to refine the edges of the detected quarters. The process is depicted in Figure 13, where the global semantic segmentation dynamic objects (column 2) are combined with the quarters output (column 3).

Object Refinement
We provide an additional post-processing step in order to improve the quality of the detected objects. The dynamic objects from the global segmentation head (described in Section 3.3) are used to improve the four quarters from the object detection head. Therefore, the semantic output is used to refine the edges of the detected quarters. The process is depicted in Figure 13, where the global semantic segmentation dynamic objects (column 2) are combined with the quarters output (column 3).

Figure 13.
Fusing the semantic output with the quarters to improve object detection.

Vanishing-Point Computation
Vanishing-point computation methods have been presented in the literature. Most methods are based either on line intersections or feature analysis or use artificial networks to predict the 2D point coordinates of the vanishing point [33]. In this work, we make use

Vanishing-Point Computation
Vanishing-point computation methods have been presented in the literature. Most methods are based either on line intersections or feature analysis or use artificial networks to predict the 2D point coordinates of the vanishing point [33]. In this work, we make use of the results of our previously published work [34] to produce the required training images for the neural network. Our previous paper presented an algorithm for detecting the vanishing point by computing the orientation and magnitude of the gradient that is used to generate three vote-map images. The first image contains the vote map of the features from the left side of the input image, the second image has the vote-map features from the right side of the input image, and the third image is actually the multiplication result of the first two images (left and right vote maps). Figure 14 presents an example of the three vote-map images. The voting maps can be computed using classic image processing techniques (by computing the orientation of the gradients) or can be extracted directly from a fully convolutional CNN, such as the one proposed in this paper. Figure 13. Fusing the semantic output with the quarters to improve object detection.

Vanishing-Point Computation
Vanishing-point computation methods have been presented in the literature. Most methods are based either on line intersections or feature analysis or use artificial networks to predict the 2D point coordinates of the vanishing point [33]. In this work, we make use of the results of our previously published work [34] to produce the required training images for the neural network. Our previous paper presented an algorithm for detecting the vanishing point by computing the orientation and magnitude of the gradient that is used to generate three vote-map images. The first image contains the vote map of the features from the left side of the input image, the second image has the vote-map features from the right side of the input image, and the third image is actually the multiplication result of the first two images (left and right vote maps). Figure 14 presents an example of the three vote-map images. The voting maps can be computed using classic image processing techniques (by computing the orientation of the gradients) or can be extracted directly from a fully convolutional CNN, such as the one proposed in this paper.  The coordinates of the vanishing point can be computed using a sliding window on the third image, or by extracting the maximum from the third vote map image. Extracting the maximum takes an additional 0.09 ms on average to compute; therefore, we used this version. Extracting the VP coordinates can be achieved directly in an end-to-end manner via a CNN with a different architecture; however, in our case, we preferred to leverage the shared layers from the encoder, in order to produce the three vote-map images using semantic segmentation and then extract the coordinates from the third output layer. The other two predicted layers (the left and right vote maps) can be used as input for a lane detection system on marked roads, or as an additional step to validate the extracted vanishing-point coordinates.
Computing the vanishing point is relevant, due to the fact that it can be used to compute the extrinsic camera parameters if we assume some geometric constraints (a flat road assumption and a small lens distortion). The pitch and yaw angles of the camera with respect to the world can be computed if the focal distance is also known (along with the image size), as we presented in [33].

Training the Multi-Output CNN
Multi-task learning requires special datasets and databases for training. For the proposed solution, we have used three well-known datasets: CityScapes [35], the Berkeley Deep Drive (BDD) [36], and Mapillary [37]. All the images were processed in order to have the same size and aspect ratio; we also filtered the images to select only those that feature a large number of road pixels. After this selection, we processed the number of images used for training, as follows: 2759 images were taken from BDD, 2975 images from CityScapes, and 17,109 images from Mapillary. These databases contain relevant information for semantic segmentation and the bounding boxes of the obstacle objects. By using this data, we could extract the four obstacle quarters. The input images were split into four quarters, using the available bounding boxes of the objects. The next step consisted of masking each quarter with the semantic segmentation maps in order to generate the top-left, top-right, bottom-left, and bottom-right binary images of the individual object instances. The process is presented in Figure 15.
posed solution, we have used three well-known datasets: CityScapes [35], the Berkeley Deep Drive (BDD) [36], and Mapillary [37]. All the images were processed in order to have the same size and aspect ratio; we also filtered the images to select only those that feature a large number of road pixels. After this selection, we processed the number of images used for training, as follows: 2759 images were taken from BDD, 2975 images from City-Scapes, and 17,109 images from Mapillary. These databases contain relevant information for semantic segmentation and the bounding boxes of the obstacle objects. By using this data, we could extract the four obstacle quarters. The input images were split into four quarters, using the available bounding boxes of the objects. The next step consisted of masking each quarter with the semantic segmentation maps in order to generate the topleft, top-right, bottom-left, and bottom-right binary images of the individual object instances. The process is presented in Figure 15.  The prepared images from the datasets were also augmented during training. We used the techniques of random intensity and saturation adjustments in the HSV color space, as well as random image scaling and translation.
The semantic segmentation module used binary cross-entropy and the Sorensen-Dice [38] loss function, which is a modified version of the intersect over union loss. The loss function for the obstacle detection was the same. The vanishing point module also used the same loss function as the semantic segmentation; generating the training data is described in Section 3.6 During the training process, each loss function can be configured with a different weight. We used the same weight for all three loss functions, having experimented with various settings.
Training the neural network was performed for a total of 500 epochs; it was executed on a system equipped with two Nvidia 1080 Ti GPUs that featured a total of 22 GB of memory. The patience parameter was set to 50 epochs; therefore, if the loss function did not improve for 50 epochs, the training process was stopped early. With this hardware, training one epoch took around 300 s, which means that a fully working multi-output CNN model can be obtained in less than 10 h (with early stopping). For our system, the ResNet-50 encoder was initialized with the weights from ResNet-50 and trained on ImageNet, which helped speed up the training process. We also tried freezing some layers during training, but this did not improve the final results.

Results and Evaluation
The same hardware setup was used for evaluating as well as training the model. The experimental setup is based on the Python programming language, the project being implemented using open-source deep learning software. The proposed multiple output network was evaluated, using not only existing popular datasets (as mentioned in Section 4) but also our own datasets [17,33].
For the semantic segmentation output head, the evaluation was performed using the CityScapes validation dataset; the results were published in our previous work [16], where we used the free road space in order to detect on-road obstacles and the segmented road surface was integrated into our monocular perception system, which is able to track the detected obstacles using a particle filter. For the road class, we obtained an IoU score of 0.90.
The obstacle detection output head that predicts the obstacle quarters represents the main information used for detecting the obstacles from the road traffic scene. Therefore, the obstacle detection module was also evaluated using the dataset presented in [17], which features road-traffic images captured with different camera systems from the ones used in the training datasets. For this dataset, the ground truth is considered to be the information obtained from a stereo-vision camera setup. We compared our results with our previously proposed system and previously published results in [39,40]. On our own dataset, which features over 1000 images of object-bounding boxes extracted from the stereo-vision data, we obtained an IoU score of 0.74. On the CityScapes validation dataset, we obtained an IoU score of 0.83 and, on the KITTI [41] dataset, a 0.70 IoU score. The results were also compared with a Yolo V3 [5] detector featuring a DarkNet backbone, which was trained on KITTI using the same loss as the initial Yolo paper. We also compared the results with a Yolo V3 detector with the same ResNet-50 backbone, trained on the same datasets as our proposed model. We have also compared our results with the bounding boxes from the Mask R-CNN and Yolact; the results are similar to ours, with minor differences (1-2% in favor of the Mask R-CNN or Yolact). The bounding box obstacle detection results are presented in Table 1. The stereo-vision database from Table 1 used images acquired with a pair of cameras from which we used only the left image from the setup. The ground truth data was considered to be the result of a stereo-tracking algorithm [42]. The main advantage of a stereo-vision setup is that in the case of overlapping objects, they can be easily detected based on their depth information. Still, our proposed system, using a monocular camera setup, obtained a high accuracy. The evaluation in terms of complex city scenarios was proven to be highly accurate, especially using the CityScapes dataset. When testing on the KITTI database, our results were poorer due to the different aspect ratios of the dataset images. To address this issue, we had two possibilities: we either resized the KITTI image to fit the 256 × 256 pixels input of the CNN, or we cropped the KITTI image and then resized it. Both variants would affect our evaluation results; the first choice would produce a severe deformation of the objects from the traffic scene, which would highly impact the detection process, whereas the second choice (cropping the input image), would most likely prevent various objects from the scene being considered for detection. For this evaluation, we used the first option.
For the 2D bounding box (object detection) evaluation, we obtained a marginal improvement (1%) from previously published work [40], due to the image fusion of the output heads (as described in Section 3.5).
The pixel-wise evaluation of the detection was also performed. We evaluated the instance segmentation on the CityScapes dataset and compared the results with an R-CNN and Yolact, tested on the same test data from the dataset. The results are presented in Table 2. We also present some of the qualitative results of our proposed network, versus Yolact and R-CNN, in Figure 16. The Mask R-CNN would extract well-defined obstacle edges that were more refined, but, in some cases, this method might miss objects (as can be seen in the second row of Figure 16). Yolact predictions are sometimes missing some objects from the scene, as seen in the second or third row of Figure 16; the Yolact solution also seems to predict false positives (fourth row of Figure 16). Comparison between Mask R-CNN, Yolact, and our method: the first column represents the input image, the second column shows the ground truth instances, the third column shows the Mask R-CNN results, the fourth column shows the Yolact results, and the fifth column represents our results.
The popular Mask R-CNN network has a total of 64 million parameters, while Yolact has 50 million parameters; Yolo V3, with a Darknet backbone, features 41 million parameters, while our U-Net-and ResNet-based model features 32 million parameters in total for the three output heads (semantic segmentation, vanishing point, and obstacle quarters). The total prediction time for all three outputs was 0.043 s on average, whereas for the Mask R-CNN, the prediction time was 0.23 s and, for Yolo V3, it was 0.052 s. The Yolact network predicted the output in 0.041 s. If we removed the vanishing point output head, the network was reduced to 24 million parameters and the prediction time was then 0.034 s. If we further removed the semantic segmentation output head, we ended up with a 16- Figure 16. Comparison between Mask R-CNN, Yolact, and our method: the first column represents the input image, the second column shows the ground truth instances, the third column shows the Mask R-CNN results, the fourth column shows the Yolact results, and the fifth column represents our results.
The popular Mask R-CNN network has a total of 64 million parameters, while Yolact has 50 million parameters; Yolo V3, with a Darknet backbone, features 41 million parameters, while our U-Net-and ResNet-based model features 32 million parameters in total for the three output heads (semantic segmentation, vanishing point, and obstacle quarters). The total prediction time for all three outputs was 0.043 s on average, whereas for the Mask R-CNN, the prediction time was 0.23 s and, for Yolo V3, it was 0.052 s. The Yolact network predicted the output in 0.041 s. If we removed the vanishing point output head, the network was reduced to 24 million parameters and the prediction time was then 0.034 s. If we further removed the semantic segmentation output head, we ended up with a 16-million-parameter network with a prediction time of 0.026 s.
All the results are presented in Table 3 and the tests were performed on a desktop system, equipped with an Intel i7 CPU and two Nvidia 1080 Ti GPU graphic boards that were used for both training and prediction (evaluation). The prediction time for all three output heads was similar to the one from Yolo V3, which only features obstacle detection (0.057 vs. 0.052 s, on average, on the same input test images). The total computational time for our system was higher, due to the extra post-processing required for the obstacle quarter grouping, labeling, and, finally, bounding box extraction (0.014 s); however, these steps were executed on a single CPU core and were not optimized. Nevertheless, the processing times were comparable.
The prediction and post-processing results of the proposed system, in various scenarios, are illustrated in Figure 17. A video with the results on the stereo-vision dataset is available at: https://vimeo.com/694007992 (accessed on 20 April 2022).  The prediction time for all three output heads was similar to the one from Yolo V3, which only features obstacle detection (0.057 vs. 0.052 s, on average, on the same input test images). The total computational time for our system was higher, due to the extra post-processing required for the obstacle quarter grouping, labeling, and, finally, bounding box extraction (0.014 s); however, these steps were executed on a single CPU core and were not optimized. Nevertheless, the processing times were comparable.
The prediction and post-processing results of the proposed system, in various scenarios, are illustrated in Figure 17. A video with the results on the stereo-vision dataset is available at: https://vimeo.com/694007992 (accessed on 6 June 2022). x FOR PEER REVIEW 20 of 23 Figure 17. The first column represents the input image, while the second column is the quarter prediction (with the CNN output merged), the third column represents the labeling output, and the fourth column is the resulting bounding boxes of the individual obstacles.
The third output head represents the vanishing-point prediction. The vanishingpoint coordinates are extracted from the CNN-predicted mask, which takes an additional 0.09 ms on average. The results are presented in Table 4, as already published in our pre- Figure 17. The first column represents the input image, while the second column is the quarter prediction (with the CNN output merged), the third column represents the labeling output, and the fourth column is the resulting bounding boxes of the individual obstacles.
The third output head represents the vanishing-point prediction. The vanishing-point coordinates are extracted from the CNN-predicted mask, which takes an additional 0.09 ms on average. The results are presented in Table 4, as already published in our previous work [39]. The evaluation presented in Table 4 was performed using the CityScapes test set and the dataset from [33]. The metric used was NormDist, which represents the RMSE pixel error, divided by the image diagonal.

Conclusions
In this paper, we propose a solution that is capable of accurately detecting on-road obstacles and their actual instances in various road-traffic scenarios. We achieved this by leveraging encoder-decoder-based artificial networks and geometry-based computer vision algorithms. The information about object parts was extracted using semantic segmentation networks and then used in a low-complexity clustering algorithm. Therefore, our proposed solution is capable of detecting individual objects, even if they are partly occluded or are in close contact. The multi-output CNN model, together with the post-processing algorithm, represents a different approach to the traditional object segmentation and detection problem. The system is also lighter in terms of computational demand and is easier to train, while also having an accuracy comparable with much more complex artificial neural networks. We have presented a unique approach to detecting object instances, along with the semantic segmentation of the scene. In this approach, we fuse the two prediction outputs to obtain better object instances; we also predict information regarding the vanishing point that can be used later to compute the extrinsic camera parameters. In conclusion, the proposed approach features a lower number of parameters than in similar published work, has similar or better performance than previous approaches when evaluating instances and the 2D bounding boxes, and is easy to train and deploy. The proposed solution can be used in real-time systems with a single camera to predict individual obstacle instances in roadtraffic scenarios, while also predicting the vanishing point, which can be used for the selfcalibration of the camera. Our proposed solution has been evaluated using well-established performance indicators on publicly available datasets and our own acquired database.
Future work will include using these results along with a tracker, in order to estimate the trajectories of the road participants. The idea is to make use of multiple consecutive frames, in order to better determine partly or fully occluded objects from the road scene. This future work will also include exploring other prediction outputs for the CNN model, such as depth or optical flow.