Multi-Scale and Occlusion Aware Network for Vehicle Detection and Segmentation on UAV Aerial Images

: With the advantage of high maneuverability, Unmanned Aerial Vehicles (UAVs) have been widely deployed in vehicle monitoring and controlling. However, processing the images captured by UAV for the extracting vehicle information is hindered by some challenges including arbitrary orientations, huge scale variations and partial occlusion. In seeking to address these challenges, we propose a novel Multi-Scale and Occlusion Aware Network (MSOA-Net) for UAV based vehicle segmentation, which consists of two parts including a Multi-Scale Feature Adaptive Fusion Network (MSFAF-Net) and a Regional Attention based Triple Head Network (RATH-Net). In MSFAF-Net, a self-adaptive feature fusion module is proposed, which can adaptively aggregate hierarchical feature maps from multiple levels to help Feature Pyramid Network (FPN) deal with the scale change of vehicles. The RATH-Net with a self-attention mechanism is proposed to guide the location-sensitive sub-networks to enhance the vehicle of interest and suppress background noise caused by occlusions. In this study, we release a large comprehensive UAV based vehicle segmentation dataset (UVSD), which is the ﬁrst public dataset for UAV based vehicle detection and segmentation. Experiments are conducted on the challenging UVSD dataset. Experimental results show that the proposed method is efﬁcient in detecting and segmenting vehicles, and outperforms the compared state-of-the-art works.


Introduction
With the advantage of high maneuverability, Unmanned Aerial Vehicles (UAVs) have been widely used in traffic monitoring and controlling [1]. For some UAV based systems, detection of vehicles is often the first challenging process [2]. Compared with common scenarios, processing images captured by UAV for accurate and robust vehicle detection is hindered by multitude of challenges. The main challenges are analyzed as follows.
1. Arbitrary orientations: Vehicles in images captured by UAV often appear with arbitrary orientations due to the viewpoint change and height change. 2. Huge scale variations: With a wide range of cruising altitudes of UAV, the scale of captured vehicles changes greatly. 3. Partial occlusion: With similar structure and colors in some scenarios, it is hard to separate vehicles that are crowded or partial occluded with each other.
These differences between images captured by UAV and the images in regular datasets (e.g., Pascal VOC [3] and Microsoft COCO [4]), make it challenging to detect and segment objects in images captured by UAV.
Previous detection methods are mostly based on horizontal bounding box (HBB), including Faster region-based convolutional networks (Faster R-CNN) [5], Grid R-CNN [6], Light-Head R-CNN [7], You Only Look Once (YOLO) [8], Fully Convolutional One-Stage Object Detection (FCOS) [9], etc. These HBB methods usually contain a lot of background pixels when detecting vehicles with arbitrary orientations, which is shown in Figure 1a.
To better detect objects with arbitrary orientations, some orientated bounding box (OBB) based detection methods have been proposed, including Rotational Region Convolutional Neural Network (R2CNN) [10], Rotation Region Proposal Network (RRPN) [11], Rotation Dense Feature Pyramid Networks (R-DFPN) [12], Faster R-CNN trained on OBB (FR-O) [13], etc. These methods can partly reduce background pixels comparing with HBB based methods. Although the OBB based method has better performance when detecting oriented objects in remote sensing images, they still have some background pixels when detecting vehicles captured by UAV at low-altitude, which is shown in Figure 1b.
Based on these analyses, we argue that a mask segmentation process can overcome the problems of HBB or OBB based methods to detect vehicles with arbitrary orientations. The comparison results of HBB, OBB and mask based method is shown in Figure 1. The HBB and OBB based methods contain a large proportion of background pixels, whereas the mask based methods just contain the region of vehicles. Segmentation can provide accurate vehicle regions for subsequent vision-based tasks in traffic monitoring systems, such as vehicle re-identification (ReID). Vehicle segmentation can improve re-identification performance by solving the problem of cluttered background [14]. Beyond that, after obtaining the mask of the vehicle, the OBB of the vehicle can be obtained by using the minimum enclosing rectangle method [15]. Yet, because of the gap between segmentation for vehicles captured by UAV and segmentation for general objects, some new models need to be developed for addressing the special problems for UAV-captured vehicles, including huge scale variations and partial occlusion. Accordingly, we propose a novel Multi-Scale and Occlusion Aware Network (MSOA-Net) for UAV based vehicle segmentation, which consists of two parts including a Multi-Scale Feature Adaptive Fusion Network (MSFAF-Net) and Regional Attention based Triple Head Network (RATH-Net). The MSFAF-Net is proposed to better deal with the large scale change of vehicles. An effective way to cope with scale change of vehicles is to utilize the multi-scale features of middle layers from backbone network. Some methods have proposed to deal with the features at different scales; FPN [16] proposes a top-down pathway to enrich the semantic information of shallow layers; PANet [17] design a bottom-up pathway to shorten the information path between lower layers and topmost features; MFPN [18] inherits all the merits of the different FPNs by assembling the three kinds of FPNs including top-down FPN, bottom-up FPN and fusing-splitting FPN; yet, these models cannot get self-adaptive weights according to the importance of features at different scales. Distinguished from these methods, we design a self-adaptive feature fusion module to measure the importance of features at different scales by learned weight vector and aggregate these features by self-adaptive weights. After that, we use the features after aggregating to enhance the original features of FPN. In this manner, we shorten the information path between features from different levels and get multi-scale features with a small semantic gap in features from different levels.
The RATH-Net is proposed to handle background noise caused by occlusion. From the perspective of UAV at low altitudes, occlusion happens very often, which is challenging to handle because of similar structures and colors. Traditional approaches aim at merely narrowing the gap between the predicted bounding box or mask and its designated ground-truth [5,19,20]. In this paper, we propose an effective way to suppress occlusion. We design a Regional Attention Module (RAM) to guide the regression branch and the mask branch to pay more attention to the current vehicle (foreground) and suppress occlusion caused by other vehicles (background) of similar structure or color.
As a result of lacking standardized public dataset, we release a public dataset called UAV based vehicle segmentation dataset (UVSD) with 5874 images and 98, 600 vehicles, which is the first public detection and segmentation dataset for UAV-captured vehicles with different attitudes, different altitudes and occlusions. Experiments are conducted on the challenging UVSD dataset. The results show that the proposed MSOA-Net is efficient in detecting and segmenting vehicles, and outperforms the-state-of-the-art compared methods.
The main contributions of this paper are summarized as follows.
1. The innovative MSOA-Net segmentation structure is proposed for addressing the special problems for UAV based vehicle detection and segmentation. 2. The multi-scale feature adaptive fusion network is proposed to adaptively integrate the low-level location information and high-level semantic information to better deal with scale change. 3. The regional attention based triple head network is proposed to better focus on the region of interest, reducing the influence of occlusions. 4. The new large comprehensive dataset called UVSD is released, which is the first public detection and segmentation dataset for UAV-captured vehicles.
This paper is organized as follows. Section 2 gives a brief introduction to the related work. In Section 3, the released dataset is detailedly described. Section 4 presents the detailed descriptions of MSOA-Net. Section 5 shows the evaluation and comparison results, and Section 6 further discusses the experimental results. In the end, Section 7 gives the final conclusion and future work.

Related Work
In this section, we first briefly review the generic object instance segmentation methods. Then, we introduce some UAV-based datasets.

Generic Object Instance Segmentation
Object instance segmentation methods are to locate and segment accurate regions of objects, which can be divided into two groups including two-stage instance segmentation and one-stage instance segmentation. Mask R-CNN [19] follows the idea of a two-stage object detection method, and adds a mask prediction branch on the basis of Faster R-CNN [5]. Based on Mask R-CNN, PANet [17] introduces a bottom-up path augmentation structure to make full use of shallow network features for instance segmentation; Mask Scoring R-CNN [21] adds a new branch to score the mask to predict a more accurate mask. These two-stage methods usually can achieve more accurate performance than one-stage methods.
One-stage instance segmentation method is mainly inspired by one-stage detector. TensorMask [22] uses a dense sliding window method to segment each pixel instance by a preset number and size of sliding windows. You Only Look At CoefficienTs (YOLACT) [23] divides the instance segmentation task into two parallel sub-tasks: one branch generates a series of prototype masks; the other branch predicts the corresponding mask coefficients of each instance. Segmenting Objects by Locations (SOLO) [24] transforms the segmentation problem into the classification problem by predicting the instance category of each pixel. In this way, the instance segmentation can be implemented directly under the supervision of the instance mask annotation without bounding box regression. In polar coordinate system [25], formulates the instance segmentation problem as predicting contour of instance through instance center classification and dense distance regression. CenterMask [26] adds a head network for mask generation to the one stage object detection algorithm (FCOS [9]) to complete the instance segmentation task. EmbedMask [27] calculates embeddings for each proposal and pixel, and finally determines whether the pixel belongs to the object within the proposal according to the embeddings distance between the pixel and the proposal. Although these methods have the advantage of high speed over the two-stage method, they often fail to achieve the accuracy of the two-stage method.
However, those algorithms are designed for general object, not directly for vehicle segmentation in the UAV aerial images.

UAV-Based Datasets
With the increasing application of UAV platform in various industries [28], more and more UAV-based datasets have been proposed in computer vision field. Bozcan et al. [29] build a multimodal UAV dataset called AU-AIR for low altitude traffic surveillance. It has multimodal sensor data collected in real outdoor environment. Hsieh et al. [30] propose a car parking lot dataset called CARPK for car detection and counting, which contains 89,777 cars collected by UAVs from different parking lots. Robicquet et al. [31] propose a large dataset called Stanford Drone dataset for trajectory forecasting and multi-target tracking, which consists of more than 19,000 targets collected by drone platform from university campus scenes and sidewalks on busy streets. Mueller et al. [32] propose a video dataset called UAV123, which is used for target tracking. The dataset contains 123 video sequences and more than 110,000 frames captured by low-altitude UAV. Du et al. [33] propose a benchmark based on UAV for object detection and tracking, which contains about 80,000 images with bounding box annotations. Barekatain et al. [34] establish a video dataset called Okutama-Action dataset, which can be used for pedestrian detection, spatiotemporal action detection and multi person tracking. It consists of a 43 min sequence with full annotation and 12 action categories. Li et al. [35] construct the drone tracking dataset, which contains 70 video sequences with manual annotation. Some of the videos were recorded by UAV on the university campus, and others were collected from YouTube. Zhu et al. [2] propose a large-scale visual object detection and tracking benchmark dataset called VisDrone2018. It consists of 263 videos and 10,209 images collected by the UAV platform in different cities of China.
These datasets are for different computer vision tasks including detection, tracking, counting, etc. Yet as far as we can find in the literature, there is no reported UAV-based vehicle segmentation related dataset.

Dataset Description
In this study, we build and release a new large-scale dataset called UAV based vehicle segmentation dataset (UVSD). The UVSD has been released in https://github.com/liuchunsense/ UVSD. UVSD contains 5874 images with 98, 600 instances with high quality instance-level semantic annotations. To ensure high quality, the annotation process was performed iteratively with a three-level quality check, overall taking about two man-hours per image. The vehicle samples and marked vehicle samples in our dataset are shown in Figure 2. Images in the first row are vehicle samples, and images in the second row are labeled vehicles.

Dataset Properties
Compared with other representative datasets for instance segmentation, the main features of UVSD are as follows: 1. The proposed dataset is extremely challenging. The vehicle instances in UVSD have the characteristics of viewpoint changes, huge scale variations, partial occlusion, densely distribution, illumination variations, etc. 2. The resolution of original images in our dataset ranges from 960 × 540 pixels to 5280 × 2970 pixels, while the resolution of images in regular datasets (e.g., Pascal VOC [3] and Microsoft COCO [4]) usually is less than 1000 × 1000 pixels. 3. There are many images with dense vehicles (more than 150 vehicles per image) in UVSD.
Therefore, the UVSD also can be used for other vision-based tasks, e.g., vehicle counting. 4. In addition to visual data and pixel-level instance annotations, UVSD includes other format annotations (i.e., pixel-level semantic, OBB and HBB annotations). UVSD also can be used for semantic segmentation task, HBB and OBB based vehicle detection tasks.

Dataset Collection
In UVSD, 4374 images are captured over urban roads, residential areas, parking lots, highways and campus in Jinan, China. The airborne platform used in this research is a DJI matrice 200 quadcopter integrated with a zenmuse X5S gimbal and camera. The on-board camera can record the videos with a resolution up to 4096 × 2160 pixels at 30 frames per second. The airborne platform is shown in Figure 3. The images in Figure 3 are from the homepage of DJI matrice 200 [36]. To collect images covering vehicles of various scales and aspect ratios, UAV images were captured at different flight heights ranging from 10 m to 150 m. At the same time, we constantly adjust the relative angles between the UAV and the vehicle to obtain images at different orientations, containing vehicles with a wide variety of scales, orientations and shapes.
In order to make the images in the dataset contain as many different scenarios as possible, we carefully select 1500 images from the VisDrone [2] dataset. VisDrone is a large-scale benchmark with annotated bounding boxes collected using various drone platforms. We try to choose images under different scenarios with various weather and lighting conditions, to complement the images we shoot. Note that the original Visdrone dataset does not contain instance-level semantic annotations, we manually label vehicle masks in these images.

Annotation Principles
In our dataset, labeling a mask of a vehicle needs 34 points on average, which takes at least 17 times more time and energy than that of labeling a bounding box with two points.
When marking a vehicle, the contour edge of the target is accurately standardized by dense and continuous point connections, and then a '.JSON' format file is obtained. The '.JSON' files mainly include category name, label form, image size, image data and coordinates of each label point.
To ensure the correctness of annotation data, we set some rules for labeling, verification and refinements. The labeling principles are as follows.
1. All clearly visible vehicles that appear in an image need to be labeled using the same software. 2. If the truncation rate of a vehicle exceeds 80%, this vehicle does not need to be labeled and tested. 3. If a vehicle partly appears in an image, we label the mask inside the image and estimate the truncation ratio based on the region inside the image. 4. Images should be zoomed in, when necessary, to obtain annotations with refined boundaries. 5. For each vehicle, we use as many mark points as possible to get the fine edges of the vehicle. 6. Each picture after annotation needs to be reviewed twice by different members of the verification team. 7. In order to protect the privacy of residents, the privacy areas such as human faces are blurred, and all image metadata, including device information and GPS locations, are removed.

Proposed Method
We propose a novel deep neural network structure, called Multi-Scale and Occlusion Aware Network (MSOA-Net), for UAV-based vehicle detection and segmentation. MSOA-Net mainly consists of two parts: Multi-Scale Feature Adaptive Fusion Network (MSFAF-Net) and Regional Attention based Triple Head Network (RATH-Net). The workflow chart of the proposed method is shown in Figure 4.
As shown in Figure 4, our MSOA-Net is a two-stage instance segmentation method based on Mask R-CNN [19]. Firstly, given an input image, a multi-scale feature adaptive fusion network adaptively aggregates semantic features at different scales using self-adaptive feature fusion module. A set of rectangular vehicle proposals at various scales are generated on the fused multi-scale feature maps using Region Proposal Network (RPN) [5], respectively. Secondly, after the RoIAlign [19] operation for each proposal, the aligned features are sent to sub-networks including classification branch, regional attention guided regression branch and regional attention-guided mask branch, with functions of classification, bounding-box regression and mask generating respectively. After these processes, the detection and segmentation results are gotten. The detailed descriptions of the multi-scale feature adaptive fusion network and regional attention based triple head network are as follows.

SAFFM
Multi-Scale Feature Adaptive Fusion Network Input RPN Classification branch

Regional Attention based Triple Head Network
RoIs RoI Align Car RAM Regional attention-guided regression branch Regional attention-guided mask branch The MSOA-Net consists of two parts: Multi-scale feature adaptive fusion network and Regional attention based triple head network. SAFFM is the proposed Self-Adaptive Feature Fusion Module, which is described in Section 4.1.1. RPN is the Region Proposal Network, which is described in Section 4.1.2. RAM is the proposed Regional Attention Module, which is described in Section 4.2.1.

Multi-Scale Feature Adaptive Fusion Network
It is generally regarded that high-level features in backbones are with more semantic information while the low-level features contain more fine details [37][38][39]. Hence, we need to fuse features from multiple layers to make full use of the advantages of features from different levels. FPN [16] designs a top-down pathway to combine multi-scale features. The sequential manner has a long path from low-level structure to topmost features [17]. In addition, FPN fuses multi-scale features via a simple summation ignoring the different importance of different scales [40]. To address these problems, we design a Self-Adaptive Feature Fusion Module (SAFFM) to select the desired features from different scales and integrate these features to enhance the original features of FPN. In this manner, we can more efficiently combine low-level representations with high-level semantic features.
The structure of multi-scale feature adaptive fusion network is shown in the left of Figure 4. The ResNet [41] is used as the backbone network to extract the features of the input image. The feed-forward computation of ResNet is implemented through a bottom-top pathway. In order to build a feature pyramid with multi-scale feature maps, we use the ResNet activations of each stage's last residual, which is denoted as conv2 (C 2 ), conv3 (C 3 ), conv4 (C 4 ) and conv5 (C 5 ). The conv1 (C 1 ) is not included in the pyramid process. In the top-down pathway, the feature maps are upsampled by a factor of 2. Lateral connections merge the upsampled map with the corresponding bottom-up map. After these processes, the generated feature maps are denoted as {P 2 , P 3 , P 4 , P 5 }.
The multi-level features {P 2 , P 3 , P 4 , P 5 } are rescaled into the same size of P4, donating as where, Rescale is usually a bilinear interpolation or adaptive average pooling operation for resolution matching. Then these features are sent to the SAFFM which is described as follows.

Self-Adaptive Feature Fusion Module
The Self-Adaptive Feature Fusion Module (SAFFM) is designed to measure the importance of features at different scales and integrate features at different scales according to the learned weights.
The structure of SAFFM is shown in Figure 5.
Formally, given feature maps R i ∈ R H×W×C , i = 2, 3, 4, 5, we first aggregate them via concatenation: where, cat(·) represents concat operation along channel dimension, R c ∈ R H×W×4C . Then, the Global Average Pooling (GAP) function is performed on R c to get global information as R cp ∈ R 1×1×4C . Thus the j-th channel of R cp is calculated by: where, R c (x, y, j) represents the pixel value at position (x, y) of the j-th channel of R c . We add 1 × 1 convolutional layers and introduce a gating mechanism to further capture channel-wise dependencies between different scale features. We use a sigmoid function to generate channel-focus weights: where, σ(·) denotes the sigmoid function, δ(·) refers to the rectifier linearity unit (ReLU), and W 2 ∈ R 4C× C 4 are parameterized as two 1 × 1 convolutional layers, S ∈ R 1×1×4C . Then we split the channel-focus weights S for each scale features donated as S i ∈ R 1×1×C , i = 2, 3, 4, 5. After that, we combine the channel-focus weights of features at each scale and the rescaled features R i use channel-wise multiplication. Last, the element-wise summation operation is used to integrate features after re-weight and get the immediate features I ∈ R H×W×C .
where, refer to channel-wise multiplication. The obtained features I are then rescaled to same size as {P 2 , P 3 , P 4 , P 5 } respectively which are donated as {N 2 , N 3 , N 4 , N 5 }. Then we use N i to strengthen the original features P i , and then get the final output Z i .
As a result, we can get the final multi-scale feature maps {Z 2 , Z 3 , Z 4 , Z 5 }. The comparison between feature maps {P 2 − P 5 } and {Z 2 − Z 5 } is shown in Figure 6. From Figure 6, we can see that the feature maps {Z 2 − Z 5 } contain richer discriminable information than {P 2 − P 5 }, especially at low-levels {Z 2 − Z 3 } and high level Z 5 , which proves the effectiveness of our method.

Region Proposal Network
The Region Proposal Network (RPN) is used to process the multi-scale feature maps to generate region proposals. Similar to Mask R-CNN, we set the areas of anchors 32 2 , 64 2 , 128 2 , 256 2 in multi-scale feature maps {Z 2 , Z 3 , Z 4 , Z 5 }. The aspect ratios of anchors in each layer are set to {1 : 2, 1 : 1, 2 : 1}. These feature maps of anchors are sent to RPN to generate region proposals. After non-maximum suppression (NMS) processing, the region proposals with high scores are screened and we can get the regions of interest (RoIs).
When RoIs are generated, RoIs will be allocated on different layers of feature maps according to its scale. Formally, large RoI will be allocated to the coarser-resolution feature maps such as Z 5 ; small RoIs will be allocated to the finer-resolution feature maps such as Z 2 . In this work, we use the same allocation strategy as FPN [16]. The allocation strategy can be described as: where, l is the level of feature maps that should be selected, l 0 is a constant set to 4, w and h are the width and height of the RoI, respectively.

Regional Attention Based Triple Head Network
When a vehicle is occluded by other vehicles, the RoI of the vehicle will contain some features of the vehicles which occlude the target one, which will cause interference when locating the target vehicle. In order to reduce the influence of occlusion on the bounding box regression and segmentation tasks which are location-sensitive, we design the Regional Attention based Triple Head Network (RATH-Net).
As shown in the right of Figure 4, RoIs firstly go through a RoIAlign [19] layer to do normalization resulting the fixed sizes. Then these RoIs with fixed sizes are sent to sub-net of head network respectively. Distinguished from the head network of Mask R-CNN [19], the proposed RATH-Net has three sub-networks working independently, including a classification branch, a regional attention-guided regression branch, and a regional attention-guided mask branch. The RAM and three branches of RATH-Net are described as follows.

Regional Attention Module
Attention mechanism has to play an increasingly important role in the computer vision field. Inspired by the self-attention mechanism [42][43][44], we design a Regional Attention Module (RAM) to spotlight meaningful pixels and suppress the background noise caused by occlusion.
The structure of RAM is shown in Figure 7. Formally, given a feature map F ∈ R H×W×C , the RAM first uses a 1 × 1 convolutional layer to compress the feature map across the channel dimension. Then, a 3 × 3 dilated convolutional layer is applied to utilize contextual information and get the feature maps F d ∈ R H×W× C 16 . The computation process is summarized as follows.
Average pooling and max pooling operations are used to aggregate spatial information. We apply average pooling and max pooling operations respectively along the channel axis of F d to get F avg , F max ∈ R H×W×1 . After that, F avg and F max are aggregated via concatenation. Then it is followed by a 3×3 convolutional layer and is normalized by the sigmoid function. The regional attention map R s is computed as: where, σ denotes the sigmoid function, f 3×3 indicates a convolution operation with the filter size of 3 × 3, cat represents concatenate operation. Finally, the regional attention guided feature map F ∈ R H×W×C is computed as: where, ⊗ denotes element-wise multiplication.

Classification Branch
The classification branch is designed to classify and output probabilities for all object classes and the background class. As shown in Figure 8a, the classification branch mainly consists of two fully connected layers, with a 1024 dimension in each layer. Each RoI goes through these two connected fully connected layers, outputting probabilities p. The loss function for the classification head is the cross-entropy loss. For each RoI, the classification loss is defined as, where, p = (p 0 , . . . , p c ) has c + 1 classes, and a is the ground-truth class. p is computed by a softmax over the c + 1 outputting of a fully connected layer.
(a) Classification branch (b) Regional attention-guided regression branch (c) Regional attention-guided mask branch The structure of regional attention-guided regression branch. (c) The structure of regional attention-guided mask branch.

Regional Attention-Guided Regression Branch
The regional attention-guided regression branch is designed to regress the detection regions and outputs parameterized coordinates of bounding boxes with the guide of RAM. The RAM module is described in Section 4.2.1.
As shown in Figure 8b, after RoI features are extracted by RoIAlign [19] with 7 × 7 resolution, those features are fed into four 3 × 3 convolutional layers and RAM sequentially. Then, a fully-connected layer is used to output predicted tuple. Like other common bounding box regression methods do [45], we define the bounding box regression loss L bbox over a tuple of true bounding-box regression targets for class a, v = (v x , v y , v w , v h ), and a predicted tuple t a = (t a x , t a y , t a w , t a h ) for class a. and, where, (x, y) are the parameterized center coordinates, w and h are the parameterized width and height respectively.

Regional Attention-Guided Mask Branch
The regional attention-guided mask branch is used to predict a 28 × 28 mask for each RoI with the guide of RAM. The RAM module is described in Section 4.2.1.
As we can see in Figure 8c, once features inside the predicted RoIs are extracted by RoIAlign layer with the size of 14 × 14, those features are fed into four 3 × 3 convolutional layers and RAM sequentially. Then, a 2 × 2 deconvolution upsamples the feature maps to 28 × 28 resolution. After a per-pixel sigmoid, the loss function for the mask-head is the cross-entropy between the segmentation result and the corresponding ground-truth. For an RoI associated with ground-truth class a, L mask is only defined on the a-th mask.
where, m × m is the size of the mask, M * is the binary ground-truth mask and M is the estimated a-th mask.
In the end, we define the multi-task loss function on each RoI as, where, α, β, γ are the weighted parameters that can be adjusted to various training requirements, revealing the emphasis between the object classification task, bounding box regression task and mask segmentation task in the current model. After dealing with three branches, the proposed RATH-Net can get detection and segmentation results. In Figure 9, we list some comparison results to show the advantage of our proposed RATH-Net. From Figure 9, we can see that original networks tend to focus more on meaningless vehicles (marked with a red circle). After the attention mechanism is introduced, the network pays more attention to the current foreground vehicle (marked with a green circle) and significantly suppresses the noise generated by the irrelevant vehicle.

Dataset and Evaluation Metrics
The established UVSD dataset is used to evaluate our method. UVSD contains 5874 images and 98, 600 vehicles, which is described in detail in Section 3. In our experiments, there are 3564 images for training, 585 images for validation and 1725 images for testing. The original vehicle samples and their marked samples in the UVSD dataset are shown in Figure 2.
The proposed MSOA-Net can perform both detection and segmentation. The evaluation methods for both detection and segmentation are needed. The different definitions of Intersection over Union (IoU) are the main differences between detection evaluation metrics and segmentation evaluation metrics.
IoU for bounding box based vehicle detection is defined as, where, Pred bbox is the predicted bounding box, and GT bbox is the ground-truth bounding box. Different from the IoU definition in vehicle detection, the IoU in vehicle segmentation is performed over masks instead of boxes. The IoU for segmentation is to quantify the overlapping percentage between the ground-truth mask and prediction mask. Mask IoU is defined as, where, Pred mask is the predicted mask, GT mask is the ground-truth mask, Pred mask ∩ GT mask means the number of pixels common between the ground-truth mask and the predicted mask, Pred mask ∪ GT mask means the total number of pixels that present across both masks. Precision and recall are used to evaluate our method. These parameters are calculated according to different IoU definitions of detection or segmentation. Precision is used to measure the prediction results, and recall is used to measure the quality of positive predictions. The precision and recall are defined as, and, where, TP indicates the true positive number, FP indicates the false positive number and FN indicates the false negative number. The average precision (AP) is used to measure the performance of an algorithm. AP is defined as, where, R and P represent recall and precision, respectively, P(R) means the curve made up of P and R. For the vehicle detection and segmentation problem, there is only one category. Hence, AP is equivalent to mean average precision (mAP). Here, AP 50 denotes the AP value when the threshold of IoU is 0.50. In other words, if IoU ≥ 0.50, the corresponding object is considered as a correct detection. According to the evaluation criteria in Microsoft COCO dataset [4], AP is the mean average at thresholds from 0.50 to 0.95 by steps of size 0.05. In this study, we use the same evaluation metrics of Microsoft COCO dataset. AP bbox denotes the AP value performed on the detection task, and AP mask denotes the AP value performed on the segmentation task. The AP S , AP M and AP L are the AP values of vehicles with small sizes (area < 32 2 ), middle sizes (32 2 < area < 96 2 ) and large sizes (area > 96 2 ). In this paper, the area of vehicle is measured as the number of pixels in the labeled segmentation mask.

Experimental Setup
Our method is coded with python based on PyTorch [46]. The main configuration of our platform is with an Intel i7-6800K CPU, 32 GB DDR4, NVIDIA TITAN-Xp graphics cards.
The backbone network architecture used in this study is ResNet-50 [41]. Without changing the aspect ratio, all the original images in the UVSD dataset are rescaled to 1333 × H or W × 800 (where H ≤ 800, W ≤ 1333). If the length or width of the resized image is not a multiple of 32, we will fill the image until both sides are multiples of 32. The implementation details are described as follows.
(1) Due to the limitation of the GPU memory, the batch size is set to 2. (2) Our net is trained on a single GPU for 24 epochs in total. (3) The initial learning rate is 0.0025 and is decreased by 0.1 at epoch 16 and 22. (4) We use a weight decay of 0.0001 and a momentum of 0.9. (5) The ratios of the weighted parameters are α : β : γ = 1 : 1 : 2. (6) No data augmentation is performed.

Evaluation of Our Segmentation Task
This experiment is designed to compare the performance of our method on segmentation and detection tasks, for the purpose of proving the hypothesis that compared with the detection methods, our method can effectively segment vehicles with arbitrary orientations and reject background.
To measure the performance of both the detection and segmentation tasks using the same criteria, we introduce a hybrid IoU IoU hybrid . The IoU hybrid is defined as, where, Pred is the predicted bounding box or mask; GT mask is the ground-truth of the object region. Similar to the definition of AP mask and AP bbox in Section 5.1, AP hybrid represents the mAP value based on IoU hybrid , which can measure detection performance and segmentation performance in the meantime; AP  Table 1 shows the test results of our method using IoU hybrid . It can be seen that the detection task only achieves 24.7% AP hybrid , which means that the bounding boxes of detection contain a large number of background pixels. The segmentation task can achieve 52.3% improvement under the AP hybrid indicator compared with detection task. Hence, the detection method can not accurately extract regions of vehicles with arbitrary orientations, and our segmentation method can extract vehicles more accurately. Figure 10 shows a comparison example of our vehicle segmentation and detection. In statistics, the average ratio of segmented masks and detection bounding boxes is 66.4% in UVSD dataset.

Comparison with State-of-the-Art Methods
We compare the proposed method with state-of-the-art instance segmentation methods to evaluate that our method has a good segmentation and detection performance. The methods for comparison include YOLACT [23], YOLACT++ [47], EmbedMask [27], PolarMask [25], CenterMask [26], Mask R-CNN [19], Mask Scoring R-CNN (MS R-CNN) [21] and PANet [17]. For fair comparison, all compared methods use the same ResNet-50-FPN as backbone and trained 24 epochs on the training set of UVSD without data augmentation. Other parameters strictly remain the same as the default settings.
The vehicle segmentation and detection results on the test set of UVSD dataset are presented in Table 2. The best result is highlighted in bold. From the results in Table 2, it can be seen that our method achieves the best results under different measures.
For segmentation task, our method achieves least 2.3% AP mask better than the other methods. With fixed IoU mask = 0.5 and IoU mask = 0.75, our method achieves at least 1.0% AP mask 50 and 2.1% AP mask 75 , respectively better than the other methods. For vehicles at different scales, our method at least obtains 2.6% AP mask , 2.4% AP mask , 1.3% AP mask improvement on the small, medium and large scale vehicles, respectively compared with other methods.
For detection task, our method achieves at least 3.2% AP bbox better than the other methods. With fixed IoU bbox = 0.5 and IoU bbox = 0.75, our method achieves at least 0.2% AP bbox 50 and 2.8% AP bbox 75 respectively better than the other methods. For vehicles at different scales, our method at least obtains 3.0% AP bbox , 3.5% AP bbox , 2.2% AP bbox improvement on the small, medium and large scale vehicles, respectively compared with other methods.
With fixed IoU mask = 0.75 and IoU bbox = 0.75, different recall and precision values can be obtained by changing the confidence score thresholds for segmentation and detection. The P-R curves of different methods for vehicle segmentation are plotted in Figure 11a. The P-R curves of different methods for vehicle detection are plotted in Figure 11b. The curve of our proposed method has a larger area than that of other methods in the segmentation task and the detection task. When the recall is specified, the precision of MSOA-Net is higher than other compared methods. When the precision value is given, the recall of MSOA-Net is higher than other compared methods. Hence, it can be concluded that the proposed MSOA-Net has better performance than other compared methods.
In our proposed method, MSFAF-Net aggregates multi-scale features by self-adaptive weights to enhance FPN which can better deal with huge scale changes; RATH-Net uses the regional attention module that we designed to guide the location-sensitive sub-networks to spotlight meaningful pixels and suppress background noise caused by occlusions. These reasons contribute to the result that our algorithm achieves better performance compared with general object segmentation methods on the UVSD dataset.  In Figure 12, we show some detection and segmentation results of our method on the images with different scenarios, which show that our methods can achieve good performance on different scenarios with occlusion, night, dense and others.

Ablation Study
In this subsection, we evaluate the impact of each component of the proposed MSOA-Net on performance.
We perform an ablation study to identify the contributions of the proposed Multi-Scale Feature Adaptive Fusion Network (MSFAF-Net) and Regional Attention based Triple Head Network (RATH-Net) over the UVSD dataset. Among them, RATH-Net is divided into two parts including Triple-Head Network (TH-Net) and Regional Attention Module (RAM).
All models trained on the training set of UVSD and were evaluated on the testing set of UVSD. We use AP mask and AP bbox as the indicator of model performance. To ensure the fairness and accuracy of the experiment, all parameters that do not involve improvement are strictly consistent. In addition, we fix the random seeds in the program to eliminate the randomness of the result. Table 3 summarizes the experimental results of the ablation study on the UVSD dataset. A detailed comparison is given in the following. Table 3. Ablation study of each component in our proposed method on UVSD. MSFAF-Net is the multi-scale feature adaptive fusion network. TH-Net is the triple-head network. RAM is the regional attention module. AP mask and AP bbox denote mAP of bounding box and mAP of mask, respectively. Subscripts S, M and L refer to small, medium and large vehicles, respectively. The bold numbers represent the best results. Baseline setup. Mask R-CNN [19] without other components is used as the baseline of the ablation experiments. ResNet-50-FPN is used as the backbone network in all experiments. As Table 3 shows, the baseline gets 74.3% AP mask and 74.7% AP bbox performance in our implementation.
Effect of MSFAF-Net. As discussed in Section 4.1, the MSFAF-Net is designed to better handle large scale variations in aerial images. Through the experimental results in Table 3, we can observe that the use of MSFAF-Net can help FPN achieve 0.5% AP mask and 0.6% AP bbox improvement. Results in small, medium and large scales are consistently improved. The improvement of vehicles with small scale is significant. Results on small scale achieve 1.0% AP mask and 1.6% AP bbox improvement. To further prove the effectiveness of our method, we compared our method with PAFPN [17] which adds a bottom-up pathway to enhance FPN. As shown in Table 4, we can see that apart from the AP mask M indicator, our method performs better than PAFPN. These results prove that our method can help FPN better cope with scale changes. Effect of RATH-Net. In this part, we consider the impact of the two proposed components included in RATH-Net, respectively. TH-Net is the multi-task head network that we re-design without regional attention network. Table 3 shows TH-Net gains 1.4% AP mask and 1.8% AP bbox improvement. This means the subnets that work independently in the head network can perform multi-task better than the shared convolutional layers. Table 5 summarizes TH-Net performance with different ratios of weighted parameters. Compared with setting the same weights for each task, increasing the segmentation task weight can significantly improve the algorithm performance. Therefore, the ratio of the weighted parameters α, β and γ of the multi-task loss is to set to 1:1:2.
RAM module adopts self-attention mechanism that can guide the head network focus on the region of interest which can effectively reduce the effect of occlusion. As can be seen from Table 3, RAM yields 0.8% AP mask and 1.1% AP bbox improvement. This proves regional attention module can boost network performance in the segmentation task and detection task by suppressing the effects of occlusion. Table 5. Performance evaluation with different ratios of weighted parameters. AP mask and AP bbox denote mAP of bounding box and mAP of mask, respectively. The bold numbers represent the best results.

Failure Cases
Although our algorithm has achieved the-state-of-the-art performance in the UVSD dataset, there are still some failure cases.
In Figure 13, we list some typical failure predictions of our method. As shown in Figure 13, our method misses some vehicles and mistakenly treats the detected closely adjacent vehicles as a whole. One possible reason for this case is that these vehicles such as trucks and buses rarely appear in the training set of UVSD. In the meantime, these crowded vehicles have similar appearance features. If more training samples are provided, the detection and segmentation can still be performed hopefully. In the future research, we will try different tricks to solve this problem such as data augmentation [48,49] and online hard sample mining [50]. In addition, we annotate all vehicles as the same category in the UVSD dataset. We will update the dataset to provide more detailed vehicle categories in future work. After that, we can adjust the training strategy or loss functions according to the difference in the number of different types of vehicles [51].

Conclusions
In this research, a new vehicle segmentation method called Multi-Scale and Occlusion Aware Network (MSOA-Net) is proposed to better deal with the problems when detecting vehicles with arbitrary orientations, huge scale variations and occlusion in UAV aerial images. Firstly, we design a multi-scale feature adaptive fusion network that can adaptively integrate multiple scales features to help FPN better deal with huge scale variations. Secondly, we propose a regional attention based head network to reduce the effects of occlusion in vehicle segmentation and regression tasks. To promote the development of drone-based computer vision, we release the public UVSD dataset for vehicle segmentation and detection, which is the first public instance segmentation dataset for UAV-captured vehicles. Experiments are conducted on the challenging UVSD dataset with different attitudes, different altitudes and occlusions. The results show that the proposed method is efficient in detecting and segmenting vehicles, and outperforms the compared methods. In the future, we aim to further improve the generalization ability of our method and increase the dataset to contain more scenarios. We hope our proposed algorithm and dataset can inspire more researchers to work on computer vision tasks based on drones.