Oriented Vehicle Detection in Aerial Images Based on YOLOv4

CNN-based object detectors have achieved great success in recent years. The available detectors adopted horizontal bounding boxes to locate various objects. However, in some unique scenarios, objects such as buildings and vehicles in aerial images may be densely arranged and have apparent orientations. Therefore, some approaches extend the horizontal bounding box to the oriented bounding box to better extract objects, usually carried out by directly regressing the angle or corners. However, this suffers from the discontinuous boundary problem caused by angular periodicity or corner order. In this paper, we propose a simple but efficient oriented object detector based on YOLOv4 architecture. We regress the offset of an object’s front point instead of its angle or corners to avoid the above mentioned problems. In addition, we introduce the intersection over union (IoU) correction factor to make the training process more stable. The experimental results on two public datasets, DOTA and HRSC2016, demonstrate that the proposed method significantly outperforms other methods in terms of detection speed while maintaining high accuracy. In DOTA, our proposed method achieved the highest mAP for the classes with prominent front-side appearances, such as small vehicles, large vehicles, and ships. The highly efficient architecture of YOLOv4 increases more than 25% detection speed compared to the other approaches.


Introduction
With the rise of convolutional neural networks (CNNs) in the past decade, object detection performance has been rapidly enhanced based on its rich feature representation. Various detection schemes have been proposed to make good use of the features. They are mainly divided into two categories: two-stage detectors [1][2][3][4] and single-stage detectors [5,6]. Although each has its merits, they both use Horizontal Bounding Boxes (HBB) to locate all kinds of objects from arbitrary views. However, for some specific applications, such as vehicle/building detection in remote sensing images and scene text detection, an Oriented Bounding Box (OBB) fits the target more closely than an HBB. On the other hand, objects with large aspect ratios and that are densely arranged also make it challenging for detectors to locate the entire object area while excluding the non-object area by HBB. Previous research has shown that OBB can help detect the above cases, which usually occur in remote sensing images. Furthermore, the orientation of an object may also provide vital information for further processing.
Recently, increasingly oriented object detectors, which adjust based on the classic object detector, have achieved promising results in remote sensing. At the same time, the maturity of cost-effective drones in manufacturing has led to the widespread use of aerial photography in many areas, such as precision agriculture, rescue work, wildlife monitoring, etc. Among the numerous aerial photography applications, traffic surveying is one of the applications that requires accurate vehicle location via OBB rather than HBB. Since most vehicles can be adequately located by OBB in the bird view, the area of each vehicle can be efficiently represented by the corners of OBB. The estimation of OBB is usually accomplished by the regression of its representation model, which usually contains five to eight parameters. However, many previous researchers calculated these parameters based on the limited quadrilateral and OBB's angle definitions in OpenCV. Ambiguous corner ordering and angular periodicity cause the discontinuous boundary problem and obstruct the regression of parameters [7,8]. The boundary discontinuity problem often causes the model's loss value to suddenly increase at the boundary situation.
Aiming to solve the problem above, we propose a simple, practical framework for oriented object detection, which can be used in traffic surveys. To avoid the angular discontinuity, we regress the front point offset of the object rather than its angle.
As mentioned above, the HBBs have been widely used to indicate the position and area of objects in general object detection tasks. For instance, the classic two-stage object detector Faster RCNN [1], which evolved from RCNN [2] and Fast RCNN [3], is to filter out the proposals from Region Proposal Network layer (RPN) in the first stage and goes through a second stage for position tuning and objects classification. RPN requires pre-defined anchors to generate proposals also designed in the form of HBB. R-FCN [4] combined Region of Interest (ROI) pooling and position-sensitive score maps to improve detection speed and alleviate object position insensitivity in CNN. On the other hand, a one-stage detector, such as YOLO [5], SSD [6] or RetinaNet [9], evaluates images in a single phase to classify and locate objects simultaneously without second-stage calibration. Although the accuracy of a one-stage detector is lower than that of a two-stage detector initially, a single-stage structure allows for higher detection speeds. To bridge the gap inaccuracy, some functional mechanisms, including anchor box, Feature Pyramid Network (FPN) [10] and Cross Stage Partial Network (CSPNet) [11], are gradually introduced in the YOLO series [12][13][14][15]. The state-of-the-art object detectors mentioned above all use HBB to describe the position of arbitrary shape objects.
In order to obtain the position and orientation of rectangular objects while eliminating background interference, OBB becomes a better alternative to HBB. Researchers have proposed many fascinating oriented object detectors based on two-stage and one-stage models. Earlier oriented object detectors were based on the two-stage scheme and applied to detect scene text. RRPN [16] directly adds an angle parameter to the RPN of Faster RCNN to make proposals directional. Then, the arbitrarily oriented proposals were converted to the fixed-size feature maps by Rotation Region of Interest (RROI) pooling to detect the oriented scene text. In R 2 CNN [17], Jiang Y. et al. mentioned that the shape of a bounding box is very similar, but the angle difference is enormous when rotated 90 • and −90 • . The discontinuity of the angle may cause difficulties in network learning. Therefore, they adopted the coordinates of the first two corners clockwise and the height of the bounding box to represent an inclined rectangle (x1, y1, x2, y2, h). To catch significant aspect ratio texts, they also performed ROI pooling with different pool sizes (7 × 7, 11 × 3, 3 × 11).
To compensate for the lack of angle information in the general object detector, the ROI transformer [18] learns the transformation parameters to convert horizontal ROI to the rotated ROI and extract rotation-invariant features for the classification and localization afterward. In recent years, more and more oriented object detectors have been proposed and applied to aerial photographs. Due to the large variability of object size/orientation and the complexity of the background in aerial scenes, many orientated object detectors focus on deeper feature extraction and attention mechanisms. R-DFPN [19] proposed a dense feature pyramid network (DFPN) to better integrate features of different scales to increase the usage rate of features. ICN [20] combines multiple modules, such as an image cascade network, feature pyramid network and deformable inception network, to achieve satisfactory performance in detecting oriented objects. SCRDet [7], CADet [21] and RADet [22] add different attention modules before RPN to better handle small objects and cluttered oriented objects. A gliding vertex [23] guarantees detection accuracy through regressing vertices offset without changing network architecture. Due to the lower computational cost and memory requirement, some oriented object detectors were developed based on a one-stage scheme. Textbox++ [24] and RRD [25] adopt the vertex regression on SSD while the latter changes the backbone convolution into oriented response convolution. Considering the difficulty of angle training in R-DFPN, SAR [26] converts some of the rotating parameters from the angular regression task to the classification task to avoid boundary and order problems. CSL [27] uses the circular smooth label technique to obtain a more robust angular prediction through classification without suffering boundary conditions caused by periodic angles. Unlike a two-stage detector, a single-stage detector does not have ROI pooling or an ROI align phase to reduce the impact of feature misalignment. R 3 Det [28] adds a refinement stage based on RetinaNet to achieve feature alignment by reconstructing the feature map and using a deeper network backbone to extract richer features. In addition to adding the refinement stage, RSDet [8] directly regresses the four corner coordinates of the oriented bounding box to avoid inherent regression inconsistencies. A modulated rotation loss is designed to address the problem of loss discontinuity.
No matter two-stage or one-stage oriented object detectors, there are three main ways to determine an oriented bounding box: the five-parameter regression-based method, he eight-parameter regression-based method and the classification-based method. Fiveparameter regression-based methods usually introduce an additional parameter θ to regress the angle of the bounding box. The range of θ is usually limited to 90 degrees [7,19,21,28] or 180 degrees [16]. However, the network is prone to suffering from the loss discontinuity problem in the boundary case. In eight-parameter regression-based methods [8,20,[23][24][25], the coordinates of the quadrilateral's four corners are regressed. The boundary discontinuity problem still exists due to the ordering of corner points. Classification-based methods [27,29] turn the angle regression task into the classification task, which truly addresses the essence of boundary problem. Nevertheless, it loses the precision caused by angle discretization and also increases the difficulty of tuning the network caused by importing the new hyperparameter window function. Moreover, almost none of the above methods can directly determine the front side of an object. It is practical to provide the front information of an object in traffic analyses or sailing analyses.
To achieve highly efficient oriented object detection, we adopt YOLOv4 architecture and extend its head part to estimate object orientation. To overcome the boundary discontinuity problem, the front point of the object is introduced. Experimental results demonstrate that our proposed method outperforms the others in terms of vehicle detection, which plays a vital role in traffic analysis.

The Problems of Regression-Based Rotating Object Detector
As described in Section 1, the Boundary Discontinuity Problem (BDP) is a significant obstruction of the estimation of OBB. An OBB can be determined by the center (x, y), width w, height h and angle θ of it, or just by its four corners (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), and (x 4 , y 4 ). For the former five-parameter OBB models, the angle θ is defined as the acute angle between the x-axis and the first side (width) of OBB that it touches counterclockwise as the definition in OpenCV, limited to the range [−π/2, 0) as the Figure 1a. In this case, the acute angle could be formed by x-axis and any side of OBB, which means the width w and height h could easily be confused and cause Exchangeability of Edges (EoE). To alleviate the impact of EoE, some studies further define θ as the acute angle between the x-axis and the first long side of OBB, thus extending the range from [−π/2, 0) to [−π, 0). The length of the long side is then explicitly specified as the width w and that of the short side is specified as the height h as shown in Figure 1b. However, the value of the acute angle is still not continuous, while the long side of OBB is almost parallel to the x-axis. The above periodicity of angular (PoA) may cause significant loss changes once the angle falls at either end of its permissible range. Compared to the five-parameter OBB model, which introduces an additional parameter to indicate the orientation of the object, the eight-parameter OBB model discards the angle information. Although the problem of PoA does not occur in the regression of OBB's four corners, the consistency of the alignment of these four corners remains an issue. In addition, the eight-parameter OBB model determines rectangles and arbitrary quadrilaterals, which may be undesirable when representing artificial buildings and vehicles.

Front Point Offset Regression for Angle Prediction
To overcome the EoE and PoA problems caused by the five-parameter OBB model and to extend the angle range to [−2π, 0), the parameter is replaced by a couple of parameters and , where ( , ) indicates the midpoint of the front side of the objects, such as vehicles and buildings. In this case, an OBB is represented by six parameters, as shown in Figure 1c, and width is defined as the length of the front/rear side and height refers to the length of the other side. Inspired by the YOLO grid cell strategy, the angle prediction is converted to that of the front point offset ( , ) with respect to the grid, as shown in Figure 2. Network outputs ( , , , ) pass through a sigmoid function and adjust the size with the predefined anchor ( , ) to reconstruct the predict bounding box ( , , , , , ). Here, the front point and center respect the same grid for each OBB to make the training more convergent. Since the prediction of angle is translated from periodic angle space into the distance space domain, the PoA problem no Compared to the five-parameter OBB model, which introduces an additional θ parameter to indicate the orientation of the object, the eight-parameter OBB model discards the angle information. Although the problem of PoA does not occur in the regression of OBB's four corners, the consistency of the alignment of these four corners remains an issue. In addition, the eight-parameter OBB model determines rectangles and arbitrary quadrilaterals, which may be undesirable when representing artificial buildings and vehicles.

Front Point Offset Regression for Angle Prediction
To overcome the EoE and PoA problems caused by the five-parameter OBB model and to extend the angle range to [−2π, 0), the parameter θ is replaced by a couple of parameters x f and y f , where (x f , y f ) indicates the midpoint of the front side of the objects, such as vehicles and buildings. In this case, an OBB is represented by six parameters, as shown in Figure 1c, and width is defined as the length of the front/rear side and height refers to the length of the other side. Inspired by the YOLO grid cell strategy, the angle prediction is converted to that of the front point offset b x f , b y f with respect to the grid, as shown in Figure 2. Network outputs t x , t y , t w , t h pass through a sigmoid function σ and adjust the size with the predefined anchor (p w , p h ) to reconstruct the predict bounding box Here, the front point and center respect the same grid for each OBB to make the training more convergent. Since the prediction of angle is translated from periodic angle space into the distance space domain, the PoA problem no longer exists.
In addition, the assignment of the front point makes the alignment of corners no longer ambiguous and confusing, which fundamentally avoids the EoE problem. longer exists. In addition, the assignment of the front point makes the alignment of corners no longer ambiguous and confusing, which fundamentally avoids the EoE problem.

Overview of the Network
The proposed network architecture is based on YOLOv4, as illustrated in Figure 3. YOLOv4 is one of the fastest and most accurate object detectors available today. Its architecture consists of three parts: backbone, neck and head. The backbone, CSPDarknet53, in YOLOv4 optimizes darknet53 with Cross Stage Partial Network (CSP) to reduce the reuse of gradient information, thereby decreasing computational cost and increasing learning ability. Yolov4 utilizes Spatial Pyramid Pooling (SPP) for expanding the field of view in the neck part. It combines Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN) for fusing low-level and high-level features from different scales to achieve better localization. The head part is the prediction layer which contains a certain amount of grid cells at three different scale levels. Each grid cell outputs the prediction results into a ( 5) dimensional vector, with representing the probability of each class, and is the number of predefined anchors in each prediction layer. The remaining five dimensions encode a horizontal bounding box into , , , ℎ and confidence. We take advantage of the above YOLOv4 architecture and add two additional parameters and for each grid cell to predict front point offset. As a result, the proposed method can localize objects by OBB and obtain the following advantages:

Avoid boundary discontinuity problem
Both the angle regression and corner regression methods will produce certain boundary discontinuity problems that lead to instability in loss calculation. In contrast, our proposed method regresses front point offset and avoids the sharp increase of loss generated by the boundary discontinuity problems.

Overview of the Network
The proposed network architecture is based on YOLOv4, as illustrated in Figure 3. YOLOv4 is one of the fastest and most accurate object detectors available today. Its architecture consists of three parts: backbone, neck and head. The backbone, CSPDarknet53, in YOLOv4 optimizes darknet53 with Cross Stage Partial Network (CSP) to reduce the reuse of gradient information, thereby decreasing computational cost and increasing learning ability. Yolov4 utilizes Spatial Pyramid Pooling (SPP) for expanding the field of view in the neck part. It combines Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN) for fusing low-level and high-level features from different scales to achieve better localization. The head part is the prediction layer which contains a certain amount of grid cells at three different scale levels. Each grid cell outputs the prediction results into a (C + 5) × A dimensional vector, with C representing the probability of each class, and A is the number of predefined anchors in each prediction layer. The remaining five dimensions encode a horizontal bounding box into x, y, w, h and confidence. We take advantage of the above YOLOv4 architecture and add two additional parameters t x f and t y f for each grid cell to predict front point offset. As a result, the proposed method can localize objects by OBB and obtain the following advantages:

1.
Avoid boundary discontinuity problem Both the angle regression and corner regression methods will produce certain boundary discontinuity problems that lead to instability in loss calculation. In contrast, our proposed method regresses front point offset and avoids the sharp increase of loss generated by the boundary discontinuity problems.

Less predefined anchors
It is not only necessary to use anchors of different sizes and aspect ratios to provide anchors closer to the ground truth for the five-parameter oriented object detector but also to add anchors at multiple angles to help the model converge, which also increase computational cost in the process. Our approach does not focus on the regression of angle. We regress the front point offset to maintain the overall performance of the model without increasing the number of anchors.

3.
Explicit front end information Since our method locates the front point coordinates of the object, we can avoid boundary discontinuous and obtain the object's orientation within [−2π, 0). This helps to indicate the front end of the stopped vehicle and to understand whether the vehicle is in violation and the possible direction of travel. In contrast, the other OBB definitions are more likely to face the discontinuity problem of extreme values of angles and provide limited angle ranges. It is not only necessary to use anchors of different sizes and aspect ratios to provide anchors closer to the ground truth for the five-parameter oriented object detector but also to add anchors at multiple angles to help the model converge, which also increase computational cost in the process. Our approach does not focus on the regression of angle. We regress the front point offset to maintain the overall performance of the model without increasing the number of anchors.

Explicit front end information
Since our method locates the front point coordinates of the object, we can avoid boundary discontinuous and obtain the object's orientation within [−2π, 0). This helps to indicate the front end of the stopped vehicle and to understand whether the vehicle is in violation and the possible direction of travel. In contrast, the other OBB definitions are more likely to face the discontinuity problem of extreme values of angles and provide limited angle ranges.

Loss Function
The loss function is decomposed into four parts, which are defined as follows: where is the number of grid cells in the feature map and denotes the number of anchors in each grid cell.
is the number of positive samples. N is the number of A dimensional vector, where C is the number of classes, A is the number of predefined anchors in each prediction layer and each anchor contains seven parameters: t x , t y , t xf , t yf , t w , t h and confidence p obj . denotes feature map concatenation.

Loss Function
The loss function is decomposed into four parts, which are defined as follows: are both binary values, which indicate positive samples or not. λ 1 to λ 4 are the hyper-parameters to balance the weights of each loss term.
Since the factor of the direction of front point is not considered in the IoU, the IoU can be very small, even if the front point of two OBBs is located in a completely different direction. To penalize the loss of IoU by using the angular difference caused by the front points, the correction of the vectors formed by the center point and front point is introduced in L IoU as below: where P c and P f indicate the center point and front point of a predicted box. T c and T f represent the center point and front point of ground truth box. The cosine value of half of the angle formed by vectors P c P f and T c T f is in the range [0, 1], which ensures that the value of L IoU remains in the range [0, 1].

Datasets and Results
The proposed method was implemented by PyTorch [30] with Tesla V100 and 32G memory. To verify the performance of our proposed method, we conducted experiments on three datasets, which contain OBB labeling information. and helicopter (HC). Half of the original images were randomly selected as the training set, 1/6 as the validation set, and 1/3 as the testing set. We cropped original images into 1024 × 1024 patches with a stride set to 512. Finally, we obtained 29,457 patches for the training process.

HRSC2016
HRSC2016 [32] is a high-resolution ship dataset collected from six different harbors on Google Earth. It contains 1061 images with sizes ranging from 300 × 300 to 1500 × 900. A rotating rectangle labels ground truths. We followed the original division to obtain the training, validation and test sets with 436, 181 and 444 images, respectively. The size of HRSC2016 is significantly smaller than the DOTA in terms of the number of images, the number of categories and the diversity of content. In DOTA1.0, the number of ships used for training was around tens of thousands. Most of the images contain dozens to thousands of targets. On the other hand, HRSC2016 has about 3000 targets in all.

Our Dataset
To verify the robustness of the proposed method in traffic applications, we used UAVs to capture images from several road intersections at 65~100 m above the ground. All the images are sized 1920 × 1080. Unlike remote sensing images, detection targets and vehicles are much closer to the camera, making the targets look much more significant in our images. We randomly selected 723 images for training and validation and 180 for testing. All images are annotated with OBB in eight classes: sedan, truck, bus, tractor, trailer and motorbike. The number of each class is listed in Table 1.

Training Details
CSPDarknet53 is the backbone of architecture, which was pretrained on COCO [33] and provided by YOLOv4. The stochastic gradient descent (SGD) was used as an optimizer in all datasets, and its weight decay and momentum were set to 0.0005 and 0.937, respectively. Training epochs and learning rates were set differently in each dataset. We trained 250 epochs on DOTA and 300 epochs on HRSC2016 and our dataset. The initial learning rates for DOTA, HRSC2016 and our dataset were 0.01, 0.0001 and 0.05, respectively. The mini-batch sizes of DOTA and HRSC2016 were set to eight and set to one for our dataset. The augmentation method is applied to all datasets, including mosaic, random flip, random color, random rotation and random scaling. Table 2 shows the comparison results on DOTA. The results here are obtained from the official DOTA evaluation server. Our proposed method attains a good trade-off between accuracy and speed. With test time augmentation (TTA), the mAP of our proposed method achieves 73.89% with 20.4 fps on average. Our results outperform all the other methods at the same speed level. Moreover, the three categories: small vehicle, large vehicle and ship obtain the highest mAP 79.37%, 83.34% and 88.65% in all methods. This may be due to the obvious orientation of these three types of objects and the fact that the bounding boxes of the objects are more rectangular in shape. Among the methods that are better than ours in the table, Gliding Vertex [23], Mask OBB [34], FFA [35], APE [36] and CenterMap OBB [37] are based on the slower two-stage detectors and RSDet [8], GCL [29] and CSL [27] are based on the single-stage methods while using deeper backbone (ResNet152). In terms of speed, our method is 5 fps faster than SAR [26], which is the fastest method on record in Table 2. The first reason is that our proposed approach is based on the YOLOv4 architecture, which is designed for high efficiency. Second, in some oriented object detectors, many anchors with different sizes and angles are proposed for better detection results. This leads to a significant increase in the number of detection candidates and increases the computation time when non-maximal suppression is applied. Our proposed six-parameter OBB model has the potential to maintain accuracy while keeping the minimum number of anchors.

Result on HRSC2016
HRSC2016 contains a large number of ships with arbitrary rotation and large aspect ratios, which is a great challenge to locate accurately. The comparison results are listed in Table 3. With TTA, the mAP of our proposed method achieves 93.7% under voc2012 evaluation protocol. It is worth noting that the inference speed for most methods is around 10 fps and that of the fastest method is 15.53 fps for 800 × 800 images. In comparison, our method can reach an inference speed of 19.23 fps and even 52.63 fps without TTA, which is more than five times faster than most SOTA methods.

Result on Our Dataset
Our dataset evaluation results and confusion matrix are shown in Table 1 and Figure 4. Vehicles are roughly divided into six categories: sedan, truck, bus, tractor, trailer and motorbike. Most of the large vehicles have an area of more than 4000 pixels, and the trailer may exceed 17,000 pixels. On the other hand, the area of a motorbike is less than 800. The mAP of a tractor and trailer only achieve 87.9% and 80.3% due to the lack of corresponding training samples. Nevertheless, the other three types of large objects can achieve over 93%, and sedans can even reach 98.1% mAP. In terms of small vehicles, the mAP of the motorbike is 80%. Low mAP is caused by two main reasons. The first reason is the small appearance of the motorbike in the aerial image. The crowded and irregular groups of motorcycles further increase the confusion on the feature map. Therefore, a motorbike could be easily mis-detected in a waiting zone or under the shade of trees. The second reason is that many incomplete motorbikes in the training images were mislabeled, which caused a high false positive rate, as shown in Figure 4.

Discussion
The evaluation from the experimental results shows that our method has a good trade-off between accuracy and speed, especially when the object has an explicit frontside appearance. However, the proposed method has weaknesses in the following cases: (1) When an object does not have a visible front side, determining its front side could be ambiguous, as shown in Figure 8. (2) It is still difficult to precisely locate objects with very large aspect ratio, as shown in Figure 9. In the future, we will try to add a lightweight attention module to locate the front point or use deformable convolution to detect objects with large aspect ratios. Figure 8. The ambiguous problem of determining the front side of the tennis court. Since any opposite side of a tennis court is symmetrical, it is hard to define its front side clearly. The left detection result treats the longer side as the front side while the right one treats the shorter side as the front side. This makes the detection results inconsistent and leads to imprecise OBB positioning.

Discussion
The evaluation from the experimental results shows that our method has a good trade-off between accuracy and speed, especially when the object has an explicit frontside appearance. However, the proposed method has weaknesses in the following cases: (1) When an object does not have a visible front side, determining its front side could be ambiguous, as shown in Figure 8. (2) It is still difficult to precisely locate objects with very large aspect ratio, as shown in Figure 9. In the future, we will try to add a lightweight attention module to locate the front point or use deformable convolution to detect objects with large aspect ratios.

Discussion
The evaluation from the experimental results shows that our method has a good trade-off between accuracy and speed, especially when the object has an explicit frontside appearance. However, the proposed method has weaknesses in the following cases: (1) When an object does not have a visible front side, determining its front side could be ambiguous, as shown in Figure 8. (2) It is still difficult to precisely locate objects with very large aspect ratio, as shown in Figure 9. In the future, we will try to add a lightweight attention module to locate the front point or use deformable convolution to detect objects with large aspect ratios. Figure 8. The ambiguous problem of determining the front side of the tennis court. Since any opposite side of a tennis court is symmetrical, it is hard to define its front side clearly. The left detection result treats the longer side as the front side while the right one treats the shorter side as the front side. This makes the detection results inconsistent and leads to imprecise OBB positioning. Figure 8. The ambiguous problem of determining the front side of the tennis court. Since any opposite side of a tennis court is symmetrical, it is hard to define its front side clearly. The left detection result treats the longer side as the front side while the right one treats the shorter side as the front side. This makes the detection results inconsistent and leads to imprecise OBB positioning. Figure 9. The OBB and front point of the object with very large aspect ratio is difficult to precisely locate.

Conclusions
In this paper, we proposed an arbitrary oriented object detector based on the YOLOv4 framework. We introduced the front point offset of an OBB to avoid the discontinuous boundary problem caused by the regression of angle or corners. The proposed method is especially suitable for vehicle localization in aerial images. It not only generates the OBB fit for the vehicle but also indicates the front end of it, which plays a vital role in traffic analysis. With the validation of the public datasets DOTA and HRSC2016, our detector was able to reach promising detection accuracy and achieve outstanding inference speed. The inference speed outperforms all the other methods and can accomplish realtime detection. Since our approach focuses on changing the head area's structure, it is easier to further improve the accuracy and speed with future upgrades of the YOLO series. Figure 9. The OBB and front point of the object with very large aspect ratio is difficult to precisely locate.

Conclusions
In this paper, we proposed an arbitrary oriented object detector based on the YOLOv4 framework. We introduced the front point offset of an OBB to avoid the discontinuous boundary problem caused by the regression of angle or corners. The proposed method is especially suitable for vehicle localization in aerial images. It not only generates the OBB fit for the vehicle but also indicates the front end of it, which plays a vital role in traffic analysis. With the validation of the public datasets DOTA and HRSC2016, our detector was able to reach promising detection accuracy and achieve outstanding inference speed. The inference speed outperforms all the other methods and can accomplish real-time detection. Since our approach focuses on changing the head area's structure, it is easier to further improve the accuracy and speed with future upgrades of the YOLO series.