EFN: Field-based Object Detection for Aerial Images

: In this paper, We propose a ﬁeld-based network for object detection: Ellipse Field 1 Network(EFN). It is a elegant way to detect the objects that is cluttered and rotated. EFN works 2 with the probability ﬁelds which can preserves the information of object distribution in image space 3 during forward propagation. It is for object detection in aerial images, and also work well in natural 4 images detection. The extensive experiments have validated that EFN can work with a light weight 5 model and doesn’t sacriﬁcing performance. We achieve state-of-the-art results in aerial images test, 6 and a good score in natural images.

call this method Ellipse Field Network(EFN). In addition, we designed a special post-processing for 48 Object Field: Ellipse Region Fitting(ERF), it combine the center field and the edge field, and finally get 49 the ellipse region set of objects. To sum up, our work has several advantages: 50 • Field-based, preserves the location information.

52
• A robust post-processing makes result more reliable.

53
• The framework is one-stage, get rotated boxes directly.  Fig.2 illustrates the EFN architecture. The EFN usually takes 418*418 images as input, so it need to 56 be cropped when the input is high-resolution. The network first process the input image with several 57 convolutional, max pool and concatenate layer. Then, branch into two sibling output layer: one is 58 the Object Field and another is the Edge Field. Each output layer has several channels corresponding 59 to the number of categories. After that, the ERF algorithm will process the output to get the center 60 points and edge points of each object and finally figures out the ellipse. As in (1), we use 5 parameters: 61 x 0 , y 0 , a, b and θ to define an ellipse, where the x 0 , y 0 are the coordinates of the center point of the 62 ellipse, a, b are the semi-major axis and the semi-minor axis, θ is the angles of rotation. The function 63 F(x, y; x 0 , y 0 , a, b, θ) describes the relationship between a point and an ellipse.

64
F(x, y; x 0 , y 0 , a, b, θ) = [cos θ · (x − x 0 ) + sin θ · (y − y 0 )] 2 a 2 + [− sin θ · (x − x 0 ) + cos θ · (y − y 0 )] 2 b 2 − 1 (1) Based on that, We define the Center Field intensity G c,p of pixel (x, y) is calculated as in (2), where α 68 is a coefficient we call center field decay index and set the default value 2.5, it determines the decay 69 rate. When objects are densely packed, some points may be in more than one ellipse, we choose the 70 ellipse which has minimum distance with the point. Intensity decays from 1 at the center of an object 71 to e −α at the edge with a certain rate. In areas containing no objects, the intensity is 0. Fig.3(a) shows 72 the intensity distribution.

74
Similarly, the Edge Field represents the distribution of edge intensity which describe the distance 75 between the pixels and edge of objects. According to (1), the sufficient and necessary conditions of the 76 pixels on an edge is F = 0. Based on that, We define the Center Field intensity G e,p of pixel (x, y) is 77 calculated as in (3). Theoretically, the edge of an object is an elliptic boundary formed by a sequence 78 of connected pixels. In other words, the edge is pretty slim, which make it difficult for detection. To

79
reduce the impact of that, we define a parameter ω called edge width to adjust the width of the edges. 80 We set the default value 0.1. The visualization of the edge field is shown in Fig.3

82
Because the small object is easy to be submerged, we give more weight to small objects. When 83 network is training, We first get the area A obj(p) of the rectangle to which the pixel belongs, and then 84 set the weight according to the reciprocal of the area size, as in (4), where is a bias, we set it to 0.1 by Finally, the output layer scans the whole output array to determine the distance between each 87 pixel and each ellipse, and then get the loss as in (5). Where p represents the pixel of the images, v c,p 88 and v e,p are the center intensity and edge intensity predicted by the network, G c,p and G e,p are the 89 ground truth. In this way, the impact of object size can be reduced to some extent.
where f i (x, y) is the value of field intensity, e i (x, y) is the shortest distance from the edge of the point 104 mapped to the i-th cropped image at (x, y) of the large image.  The first step is to acquire the initial coordinates of the center points according to the Center Field 112 which represents the object intensity distribution. For each channel, we scan the elements sequentially.

113
As shown in Fig.5(a), if the intensity v c,p ≤ e −α , we search for the maximum intensity in the eight pixel 114 around the current one. If the maximum intensity is greater than v c,p , we keep searching from the new 115 pixel until there is no pixel greater and record its coordinates (x c , y c ). After the processing of a channel, 116 we get a group of local maximum coordinates of the pixels, which are the initial coordinates of the 117 center points of a category object.

118
The second step is to acquire the points on the edges. Edge Field represents the edge intensity 119 distribution. We start from the initial center points acquired from first step, producing a ray every 120 three degrees from 0°to 360°. As shown in Fig.5(b), along with the rays, the v e,p jumps somewhere 121 and the v c,p decays from the center point. We start from the center points and scan pixel along with the 122 rays, if the v c,p of a pixel is less than e −α , or the v e,p is more than 0.4 greater than that of the prior pixel, 123 we think it is one of the points on the edge. After the processing of each ray, we get 120 points on an 124 edge, and use parametric equations to record these points, as in (7).

126
The final step is to figure out the parameters. The elliptic equation is nonlinear that contains 127 five parameters, as in (1), requires at least five points to solve. In the former steps, we get 120 edge 128 points for each object. Accordingly, we pick five points and employ the Levenberg-Marquardt (LM) 129 method[25] to figure out the parameters. Since the initial value of the central point is selected by the 130 local maximum of the central field, the deviation will not be large, we set a center constrained condition: 131 λ(x 2 0 + y 2 0 ) = 0 (λ is a coefficient, we ues λ=2000). The constrained condition and the elliptic function,

132
for each variable to find partial derivative, then we get Jacobian matrix. Generally, the equation 133 established by the edge point with higher strength is more reliable and should be given greater weight, 134 so we use v e,p (the value of edge field) of each edge point to compose a weight diagonal matrix, as in (8). Based on the above, we get the formula, as in (9). This formula can calculate the correction of five 136 parameters. We can get reliable results by iterative correction within a certain threshold.
To improve the fault tolerance, we use the Algorithm 1 to optimize. a and b are within a certain 138 range in a specific dataset. Semi-axis, such as the DOTA is (0.001, 0.7), beyond the range is false 139 positive and should be eliminated. 6 n = The number of errors E i that is less than ξ; 7 if max_inlier < n then max_inlier = n; 8 p = [1 − (max_inlier/N) 5 ] t ; 9 count = count + 1; 10 until p < 0.0001; 11 for all inliers, solve (a, b, x 0 , y 0 , θ);     14 points and surpasses all the state-of-the-art methods, which proves that our method 176 is more suitable for oriented object detection in aerial images. We argue that there are two reasons: 1) 177 traditional frameworks first generate proposal boxes, then analyze boxes one by one to discriminate 178 whether a box is correct. It is easy to make wrong discrimination. EFN predicts fields, the intensity of 179 the object region is high while that of the no-object region is low, which is more similar to how the 180 human visual system works. 2) traditional frameworks regress bounding boxes. However, usually 181 objects only occupy small parts of images, which makes these models more nonuniform. Compared 182 to them, EFN can better model the distribution of the objects in aerial images by regressing fields.

183
Although our method was originally designed for aerial image object detection, experiments show 184 that it also works well in conventional images, and can be used in a variety of scenes with great 185 potential. We also make a comparison on memory, as shown in Table. 3. traditional frameworks use 186 deep backbones like ResNet[31] to extract features and rely on pre-trained models. As a consequence, 187 such models are memory-consuming. EFN uses U-Net as backbone, which does not rely on pre-trained 188 model, and have a relatively shallow backbone. So the model much more memory-saving.   in the preliminary phase then with larger batch size later is an effective strategy. The small batch size 217 makes for faster convergence while the larger batch size makes for fine optimization.

218
There are two important parameters in the training phase, center field decay index α and edge 219 width ω. The two parameters have significant impacts on the performance. To find out the best values, 220 we train models with a set of values. Table 6 shows the comparison of models trained with different α 221 and ω. Both of them should be set to appropriate values. If the value of α is too low, the CF decays 222 suddenly from center to edge, which may cause the wrong object center found in the ERF. The too-high 223 value of α makes the CF decays rapidly, leading to difficulty for detecting object center. A low value of 224 ω will make the EF inconspicuous. In this case, the detection of points on edge will be inaccuracy and 225 some points may be left out. A high value of ω may cause edge overlap of adjacent objects, which will 226 Unlike the aerial images, the images of VOC are in the low altitude perspective and low resolution, we 233 no longer crop the images and predict the angle of the object. In other words, we only predict four 234 parameters, predefine θ = 0, as shown in (10). Table. 7 shows the results. We achieve 84.7% mAP. The 235 visualization of detection results is shown in Fig. 9. Compare to the state-of-the-art methods, EFN 236 which specialize in different application scenarios still gets a not bad scores.

238
In this paper, we proposed a novel method for detecting objects in aerial images. Unlike typical 239 region proposal method, We introduced the concept of field into networks. We remold a common 240 network to calculate Center Field and Edge Field, and use the robust Ellipse Region Fitting algorithm 241 to identify object precisely. Only the first step is related to the learning data, the second step can be 242 applied to any target, which greatly reduces the difficulty of network training. Experiments on the    way to train it. In the future, we will do more research on EFN and continuously improve our method. The following abbreviations are used in this manuscript: