LRP-DS: Lightweight RepPoints with Decoupled Sampling Point Set

: Most object detection methods use rectangular bounding boxes to represent the object, while the representative points network (RepPoints) employs a point set to describe the object. The RepPoints can provide more ﬁne-grained localization and facilitates classiﬁcation. However, it ignores the difference between localization and classiﬁcation tasks. Therefore, a lightweight RepPoints with decoupling of the sampling point set (LRP-DS) is proposed in this paper. Firstly, the lightweight MobileNet-V2 and Feature Pyramid Networks (FPN) is employed as the backbone network to realize the lightweight network, rather than the Resnet. Secondly, considering the difference between classiﬁcation and localization tasks, the sampling points of classiﬁcation and localization are decoupled, by introducing classiﬁcation free sampling method. Finally, due to the introduction of the classiﬁcation free sampling method, the problem of the mismatch between the localization accuracy and the classiﬁcation conﬁdence is highlighted, so the localization score is employed to describe the localization accuracy independently. The ﬁnal network structure of this paper achieves 73.3% mean average precision (mAP) on the VOC07 test dataset, which is 1.9% higher than original RepPoints with the same backbone network MobileNetV2 and FPN. Our LRP-DS has a detection speed of 20FPS for the input image of (1000, 600), on RTX2060 GPU, which is nearly twice as fast as the backbone network of ResNet50 and FPN. Experimental results show the effectiveness of our method.


Introduction
Object detection is one of the most widely used tasks in computer vision. Current object detection methods can be roughly divided into two categories according to whether a priori anchor is needed: anchor-based object detection, and anchor-free object detection. Anchor-based object detection applies a large number of prior anchors to fit the boundary box of a real object, such as the famous single-stage object detection methods of YOLO [1], SSD [2], and the two-stage object detection algorithm represented by Faster R-CNN [3]. The prior anchor is usually set manually according to the statistical characteristics of the dataset. In essence, the anchor can provide a reference for the network, so that the network can obtain some prior knowledge. Therefore, in the training process of network, it is easy to learn the mapping relationship between the anchor and ground truth. In the process of reference, the predicted object box can be obtained according to the prior anchor and the offset calculated by the network. By contrast, anchor-free object detection, which has attracted a lot of attention recently, does not rely on a prior anchor, such as CornerNet [4], ExtremeNet [5], representative points network (RepPoints) [6], etc.
RepPoints directly regards the bounding boxes of object as the regression of the point set. The innovation of RepPoints is to use the set of representation points to describe the object. When the flexible representation method is applied to object detection, the set of representation point can adapt to the geometric changes of the object and provide guidance for feature extraction. However, RepPoints ignores the difference between the object 2 of 15 classification and localization task. Compared with the geometric boundary information of the object, the classification task pays more attention to the semantic key points of the object. For example, for recognition of a cat, the localization task is suitable for extracting features at the boundary of the cat, while the classification task prefers to extract features at semantic key points such as the cat's eyes and nose. In the original RepPoints, the classification and localization tasks share the same sampling positions, which is not appropriate. Therefore, this paper proposes to decouple the feature sampling positions of object classification and localization, so as to give the sampling positions of feature in the classification branch a certain degree of freedom. This allows the feature sampling points of classification to actively find the semantic key position of the object, to improve the recognition accuracy of the classification task.
In the general post-processing method non maximum suppression (NMS), the bounding box with the highest classification confidence is selected in the same category, and then filters out the detection results where the calculated value of the intersection over union (IoU) overlap is greater than the certain threshold. The above steps are repeated. Therefore, in traditional object detection methods, the classification confidence is potentially expected to describe the classification probability and localization accuracy. However, many studies (such as IoU-Net [7]) show that the classification confidence is not enough to describe the localization accuracy of the bounding box. This is considered a mismatch between classification confidence and localization accuracy, due to the double responsibility of classification confidence. On the basis of RepPoints, the problem of mismatch is more serious after the decoupling of feature sampling points between classification and localization. Therefore, the classification confidence is used to describe the category probability of the bounding box, and the localization score is introduced to describe the localization accuracy of the bounding box. In addition, RepPoints uses ResNet50 [8] and Feature Pyramid Networks (FPN) [9] as the backbone network by default, but in order to pursue the lightweight of the model, the lightweight MobileNet-V2 [10] and FPN are used as the backbone network.
Finally, the lightweight RepPoints with the decoupling of sampling point set (LRP-DS) is proposed. The main contributions of this paper are as follows: (1) The lightweight MobileNetV2 and FPN are applied as the backbone network to realize the lightweight network; (2) The classification free sampling method are proposed in this paper. The sample sets of classification and localization are decoupled, due to the difference between classification and localization tasks; (3) The localization score is employed to describe localization accuracy independently, to solve the more serious mismatch of localization accuracy and category probability, after the introduction of the classification free sampling method.

Anchor-Based Object Detection
The essence of the anchor-based detection method is to employ a large number of discrete predefined anchors to cover the possible areas of the object, and let the most suitable anchor be responsible for detecting the corresponding object. In 2014, Grishick et al. [11] proposed the first two-stage object detection method, R-CNN. After that, they successively proposed Fast R-CNN [12] and Faster R-CNN [3]. The core of this kind of algorithm is to divide the detection problem into two stages. The candidate regions are generated by the region proposal network (RPN), and then the candidate regions are classified and the localization is modified again. For the mismatch between the classification score and localization result generated in the second stage, Cai et al. [13] proposed Cascade R-CNN network. Cascade R-CNN cascades the R-CNN sub networks of Faster R-CNN many times. In the training, with the cascade level, the IoU threshold of distinguishing positive samples is continuously improved to ensure that the number of samples is not reduced in the case of training high-quality detector. Later, Pang et al. [14] pointed out three imbalances in the training process of Faster R-CNN, and proposed the Libra R-CNN.
Differing from the two-stage method, the single-stage object detection method does not rely on the output candidate box of RPN stage, but directly detects the object. The typical representative of single-stage object detection methods are YOLO [1,15,16], SSD [2] and RetinaNet [17]. SSD is used to detect objects of different scales on six feature maps of different scales. Furthermore, in order to combine the advantages of different scale features, Lin et al. [9] proposed a network FPN using multi-scale pyramid structure to construct a feature pyramid. In addition, Chen et al. [18] think that anchor and feature maps have a mismatch problem in the single-stage detection algorithm, and the two-stage target detection algorithm can alleviate this problem through RoI pooling. Therefore, Chen et al. proposed a special two-stage detection method, AlignDet, by replacing RoI pooling with RoI conv.

Anchor-Free Object Detection
In the field of object detection, although anchor-based methods are the mainstream, anchor-free detectors show strong vitality due to their efficient performance and anchorfree characteristics. Therefore, anchor-free methods have gradually become a research hotspot. The anchor-free detector can be traced back to YOLO-v1 [1], which outputs the bounding box directly, and does not depend on the prior box anchor. Yu et al. [19] proposed the UnitBox, which uses the current pixel position and the distance between the upper left corner and the lower right corner to form the bounding box of the object. Recent efforts have pushed anchor-free detection to outperform their anchor-based counterparts. Some works, such as CornerNet [4], ExtremeNet [5] and CenterNet [20], reformulate the detection problem as locating several key points of the bounding boxes. To build a bounding box, CornerNet [4] estimates the two corners of bounding box. ExtremeNet [5] represents the bounding box by calculating the four extreme points of the heatmap (top, left, bottom and right) and a center point. CenterNet [20] uses the center point, center point offset, width and height to describe the bounding box. Others like FSAF [21], Guided Anchoring [22] and FCOS [23], encode and decode the bounding boxes as anchor points and point-to-boundary distances. The FSAF [21] network adopts feature selection module (FSAF) based on an anchor-free mechanism to make online feature selection in a feature pyramid. Guided Anchoring [22] can generate anchor automatically in the process of reference. The main contribution of FCOS [23] is to propose a new center-ness branch to reduce the weight of low-quality detection results. In addition, there are some methods exploring the new anchor-free form, such as RepPoints [6]. RepPoints transforms the problem of object detection into a regression problem of the point set, and the bounding box is represented by a flexible representation of the point set. These characterization point sets can describe the precise geometric and semantic features of the object.

The Mismatch between Classification and Localization Tasks
The source of the mismatch between classification and localization tasks is the difference between the two tasks. The classification task focuses on the semantic information of the object, while the localization task focuses more on the geometric information of the object. The misalignment between classification and localization task can be divided into two aspects. On the one hand, the shared features cause mismatches at the feature level. The shared feature must be able to meet the needs of both tasks, or has a certain compromise. Typically, the popular shared feature of detecting heads in Faster R-CNN [3], brings an increase in speed and a reduction in parameters. However, it forces different tasks to focus on the same feature map. The Double-Head R-CNN [24] proposes to split these two tasks into different heads. It has a fully connected head focusing on classification and a convolution head to pay attention to bounding box regression. The TSD [25] decouples classification and localization from the spatial dimension by generating two disentangled proposals for them. However, it is not appropriate to directly generate two disentangled proposals with weak correlations from the same proposal. It may cause two disentangled proposals to represent different objects.
On the other hand, the classification confidence is not enough to describe the accuracy of localization. During the training process, the classification confidence of the positive samples is expected to be calculated as 1, regardless of whether the IoU between the bounding box and the corresponding ground truth box is high or low. Therefore, the classification confidence does not naturally reflect the localization accuracy. IoU-Net [7] designs an IoU prediction head parallel with the other branch to predict the regressed IoU score. Similarly, MS R-CNN [26] improves Mask R-CNN [27] by attaching a Mask-IoU head parallel with the Mask head to predict the IoU between the predicted mask and the corresponding ground truth mask.

Rethinking RepPoints
Usually, a four-dimensional vector {x, y, w, h} is applied to represent the bounding box of the object. The (x, y) are the coordinates of the center point of the bounding box, and (w, h) are the width and height of the bounding box respectively. The core idea of RepPoints is to represent the category and location of the object through a more refined point set. Therefore, the RepPoints method describes the object in the form of a point set P = {(x i , y i )|i = 1, . . . ,K}. These point sets not only describe the geometric information of the object, but also can be used to guide the feature sampling. Through the loss of localization and classification in the training process, RepPoints can spontaneously find suitable key points for classification and localization. In order to realize this idea, RepPoints employed deformable convolution to sample feature points in point set. Compared with standard regular convolution, the sampling position of deformable convolution can be changed according to the input. Here, the convolution operation can be regarded as the sampling process. Figure 1a is regular sampling grid of standard convolution with a kernel size of 3 × 3. In the sampling process of convolution, the positions of nine sampling points present a regular distribution. Figure 1b is a deformable convolution sampling process in which the positions of nine sampling points can be changed dynamically according to the input. In this paper, the sampling center of regular convolution is assumed to be P, and the sampling position is assumed to be R. The R defines the receptive field size and expansion rate of convolution. For example, R = {(−1, −1), (−1, 0), . . . , (0, 1), (1,1)}, can represent a sampling position of convolution with kernel size of 3 × 3 and expansion rate of 1. Then, the sampling set of the regular convolution with the center point at P can be described as L r = {P + R i | R i ∈R}. The deformable convolution employs the offset O to further enhance the sampling position, and its sampling set can be described as convolution head to pay attention to bounding box regression. The TSD [25] decouple classification and localization from the spatial dimension by generating two disentangled proposals for them. However, it is not appropriate to directly generate two disentangled proposals with weak correlations from the same proposal. It may cause two disentangled proposals to represent different objects.
On the other hand, the classification confidence is not enough to describe the accu racy of localization. During the training process, the classification confidence of the posi tive samples is expected to be calculated as 1, regardless of whether the IoU between the bounding box and the corresponding ground truth box is high or low. Therefore, the clas sification confidence does not naturally reflect the localization accuracy. IoU-Net [7] de signs an IoU prediction head parallel with the other branch to predict the regressed IoU score. Similarly, MS R-CNN [26] improves Mask R-CNN [27] by attaching a Mask-IoU head parallel with the Mask head to predict the IoU between the predicted mask and the corresponding ground truth mask.

Rethinking RepPoints
Usually, a four-dimensional vector {x, y, w, h} is applied to represent the bounding box of the object. The (x, y) are the coordinates of the center point of the bounding box, and (w h) are the width and height of the bounding box respectively. The core idea of RepPoints i to represent the category and location of the object through a more refined point set. There fore, the RepPoints method describes the object in the form of a point set P = {(xi, yi)|i = 1,…,K}. These point sets not only describe the geometric information of the object, but also can be used to guide the feature sampling. Through the loss of localization and classification in the training process, RepPoints can spontaneously find suitable key points for classifica tion and localization. In order to realize this idea, RepPoints employed deformable convo lution to sample feature points in point set. Compared with standard regular convolution the sampling position of deformable convolution can be changed according to the input Here, the convolution operation can be regarded as the sampling process. Figure 1a is reg ular sampling grid of standard convolution with a kernel size of 3 × 3. In the sampling pro cess of convolution, the positions of nine sampling points present a regular distribution Figure 1b is a deformable convolution sampling process in which the positions of nine sam pling points can be changed dynamically according to the input. In this paper, the sampling center of regular convolution is assumed to be P, and the sampling position is assumed to be R. The R defines the receptive field size and expansion rate of convolution. For example R = {(−1, −1), (−1, 0), …, (0, 1), (1,1)}, can represent a sampling position of convolution with kernel size of 3 × 3 and expansion rate of 1. Then, the sampling set of the regular convolution with the center point at P can be described as Lr = {P + Ri| Ri∈R}. The deformable convolu tion employs the offset O to further enhance the sampling position, and its sampling set can be described as Ld = {P + Ri +Oi | Ri∈R, Oi∈O}.  The offset feature map of deformable convolutions can only be learned freely by the network without adding a supervision signal directly. In RepPoints, regression loss is used to directly monitor the generation of the offset feature map, which can make the network learn the desired sampling location explicitly. As shown in Figure 2, the RepPoints can be roughly divided into two steps. In the first step, the object is initially located with the offset Appl. Sci. 2021, 11, 5876 5 of 15 map based on the output feature map of the backbone. The offset map is supervised by regression loss in the training process. Therefore, the offset map represents the point set of predicted objects at each position. The second step is to classify and further fine tune the results of the previous stage, which can use the localization results of the previous stage, the offset map, to guide the sampling of deformable convolution.
The offset feature map of deformable convolutions can only be learned freely by the network without adding a supervision signal directly. In RepPoints, regression loss is used to directly monitor the generation of the offset feature map, which can make the network learn the desired sampling location explicitly. As shown in Figure 2, the RepPoints can be roughly divided into two steps. In the first step, the object is initially located with the offset map based on the output feature map of the backbone. The offset map is supervised by regression loss in the training process. Therefore, the offset map represents the point set of predicted objects at each position. The second step is to classify and further fine tune the results of the previous stage, which can use the localization results of the previous stage, the offset map, to guide the sampling of deformable convolution. Most of the current common object detection datasets use regular rectangles to label objects. Therefore, the point set in RepPoints needs to be transformed by pseudo box function Fb, as shown in Equation 1.
where, box' is a pseudo box; Fb(.) is a pseudo box conversion function, which has three implementation forms: min-max, partial min-max, and moment-based. In this way, the supervision signal can be directly applied to the pseudo box. In the reference, the conversion function Fb(.) is also used to transform the point set P to the pseudo box to get the final detection result.
In RepPoints, the same set of points are employed to simultaneously describe the feature sampling locations required for classification and localization, which actually ignores the difference between the classification tasks and localization tasks. The localization task is used for bounding box regression, and the geometric information of the object is required for it. The classification task is responsible for the classification of the object, which focuses more on the semantic information of the object than the localization task. Therefore, the same set of sample points is used to meet the needs of the classification and localization tasks, which obviously ignores the differences of the tasks.
In addition, the author of RepPoints proposes three conversion functions Fb: minmax, partial min-max, and moment-based. In fact, the min-max and partial min-max conversion functions use the minimum bounding rectangle of the point set as the bounding box of the object. The essence of the moment-based conversion function is to use the average of all points as the midpoint of the object, and the variance of all points as the width Most of the current common object detection datasets use regular rectangles to label objects. Therefore, the point set in RepPoints needs to be transformed by pseudo box function F b , as shown in Equation (1).
where, box' is a pseudo box; F b (.) is a pseudo box conversion function, which has three implementation forms: min-max, partial min-max, and moment-based. In this way, the supervision signal can be directly applied to the pseudo box. In the reference, the conversion function F b (.) is also used to transform the point set P to the pseudo box to get the final detection result.
In RepPoints, the same set of points are employed to simultaneously describe the feature sampling locations required for classification and localization, which actually ignores the difference between the classification tasks and localization tasks. The localization task is used for bounding box regression, and the geometric information of the object is required for it. The classification task is responsible for the classification of the object, which focuses more on the semantic information of the object than the localization task. Therefore, the same set of sample points is used to meet the needs of the classification and localization tasks, which obviously ignores the differences of the tasks.
In addition, the author of RepPoints proposes three conversion functions F b : min-max, partial min-max, and moment-based. In fact, the min-max and partial min-max conversion functions use the minimum bounding rectangle of the point set as the bounding box of the object. The essence of the moment-based conversion function is to use the average of all points as the midpoint of the object, and the variance of all points as the width and height of the object. However, there are some problems in the sampling point set with the moment-based conversion function. Figure 3 visualizes the sample points of RepPoints. The green rectangle is the boundary box of the detection object; the starting point of the yellow solid line is the center point, which is the intersection of the yellow solid line in Figure 3, and the ending point of the yellow solid line is the sampling point of the localization branch L d ; the starting point of the green solid line is the sampling point of the localization branch L d , and the ending point is the sampling point after the localization fine adjustment. When using the moment-based conversion function, Figure 3b shows that some sampling points are outside the box of the object, while others are inside the object. This is because the classification task needs to sample the semantic key points of the object, which are generally inside the object, but the localization task needs to calculate the boundary box of the object through the whole point set. Therefore, due to the differences between the two tasks and the characteristics of the moment-based conversion function, some sampling points can only be located outside the object to meet the needs of the two tasks. However, the sampling points outside the target can only collect information irrelevant to the object, which may be unfavorable for object detection. Therefore, the proposed LRP-DS network tends to use the min-max conversion function. and height of the object. However, there are some problems in the sampling point set with the moment-based conversion function. Figure 3 visualizes the sample points of RepPoints. The green rectangle is the boundary box of the detection object; the starting point of the yellow solid line is the center point, which is the intersection of the yellow solid line in Figure 3, and the ending point of the yellow solid line is the sampling point of the localization branch Ld; the starting point of the green solid line is the sampling point of the localization branch Ld, and the ending point is the sampling point after the localization fine adjustment. When using the moment-based conversion function, Figure 3b shows that some sampling points are outside the box of the object, while others are inside the object. This is because the classification task needs to sample the semantic key points of the object, which are generally inside the object, but the localization task needs to calculate the boundary box of the object through the whole point set. Therefore, due to the differences between the two tasks and the characteristics of the moment-based conversion function, some sampling points can only be located outside the object to meet the needs of the two tasks. However, the sampling points outside the target can only collect information irrelevant to the object, which may be unfavorable for object detection. Therefore, the proposed LRP-DS network tends to use the min-max conversion function.

Build the Backbone Network Based on MobileNetV2 and FPN
RepPoints adopts Resnet and FPN as the backbone network by default. However, in order to pursue the lightweight of the model, this paper uses MobileNetV2 and FPN as the backbone network. Compared with the backbone network structure of Resnet50 and FPN, the detection accuracy based on MobileNetV2 and FPN is slightly lower, but it has more than double the running speed. The core idea of MobileNetV1 is to use point convolution and depth separable convolution instead of traditional regular convolution. On the basis of MobileNetV1, MobileNetV2 puts forward the inverse residual structure.
The image with resolution 1000 × 600 is adopted as the input of the network, while the stride of the MobileNetV2 network after removing the full connection is only 32. At this time, the receptive field of the network is relatively limited, which will affect the perception of large object. Therefore, on the basis of MobileNetV2, this paper adds No.8 and No.9 structures in Table 1 to make the network have a greater depth and receptive field. Finally, the adjusted MobileNetV2 structure is shown in Table 1.

Build the Backbone Network Based on MobileNetV2 and FPN
RepPoints adopts Resnet and FPN as the backbone network by default. However, in order to pursue the lightweight of the model, this paper uses MobileNetV2 and FPN as the backbone network. Compared with the backbone network structure of Resnet50 and FPN, the detection accuracy based on MobileNetV2 and FPN is slightly lower, but it has more than double the running speed. The core idea of MobileNetV1 is to use point convolution and depth separable convolution instead of traditional regular convolution. On the basis of MobileNetV1, MobileNetV2 puts forward the inverse residual structure.
The image with resolution 1000 × 600 is adopted as the input of the network, while the stride of the MobileNetV2 network after removing the full connection is only 32. At this time, the receptive field of the network is relatively limited, which will affect the perception of large object. Therefore, on the basis of MobileNetV2, this paper adds No.8 and No.9 structures in Table 1 to make the network have a greater depth and receptive field. Finally, the adjusted MobileNetV2 structure is shown in Table 1. The conv2d represents a 3 × 3 regular convolution; the bottleneck represents the inverse residual structure in MobileNetV2; t represents the expansion rate of the number of channels in the inverse residual structure; c represents the number of channels of the output feature; n represents the number of times the structure is repeated; s represents a multiple of reduction relative to the input image.
The core idea of the feature pyramid FPN is to fuse multi-scale information and make multi-scale predictions. It is committed to solving the problem of object scale, and provides a form of top-down information flow. In FPN, the last residual block of conv2, conv3, conv4, conv5 structure in Resnet are defined as {C 2 , C 3 , C 4 , C 5 } and has {4, 8, 16, 32} stride relative to the original image. Then, the FPN fuses the features from top to bottom in order, to get the final feature map {P 2 , P 3 , P 4 , P 5 }. Each feature map P i is formed by iteratively fusing the contract level feature map C i and higher level feature graph P i+1 . The feature fusion formula is as follows: where, W 3×3 and W 1×1 is a convolution with kernel size of 3 × 3 and 1 × 1, respectively; the function of W 1×1 is to convert the number of channels into 256; U 2× is the upsampling operation. Therefore, adhering to the idea of FPN, the outputs of the last layer of No.7, No.8 and No.9 are regarded as the C i to be fused and enhanced, and 96 layers are adopted as the channel number of the P i feature. Thus, the backbone network MobileNetV2 and FPN is formed in this paper.

Decouple the Sampling Point Set between the Localization and the Classification Tasks
The localization and classification tasks of RepPoints share the same feature sampling location, which ignore the difference of feature concerns between localization and classification. The main idea of this paper is to decouple the task of localization and classification, and give the feature sampling points of classification a certain degree of freedom. The initial point set P obtained by the localization branch is the sampling position set. Due to the supervision signal of the ground truth, it is actually more a geometric boundary characteristic of the object, such as the edge of the object. However, the classification task focuses more on the semantic information of the object, so it is not appropriate for them to share a set of sampling point sets.
Therefore, two classification free sampling methods are proposed which are based on the idea of task decoupling, as shown in Figure 4. The sample set of the original RepPoints can be described as Furthermore, this paper designs two kinds of coordinate transformation functions F(.). The two coordinate transformation functions are shown in formula 3 and formula 4. where, ; f xi and f yi are the free offsets in x-axis and y-axis respectively, and the value range is [0, 1]; the γ is the relaxation factor of the coordinate transformation range.
where, Ri' = Ri + Oi = (xi', yi'); fi = (fxi, fyi); fxi and fyi are the free offsets in x-axis and y-axis respectively, and the value range is [0, 1]; the γ is the relaxation factor of the coordinate transformation range.  The set of sampling points on the localization branch is L d . When the coordinate transformation function F(.) is an identity function, the sampling position L c based on the degree of freedom degenerates to the original L d .
Although the design of LRP-DS-V1 can decouple the sampling point set of classification and localization tasks, it may cause the classification task to confuse the object that really needs to be classified, that is, excessive decoupling. Take the left figure of Figure 5 as an example; the yellow rectangle is the current object. Ideally, the sampling position of localization will be distributed at the geometric position of the object. Due to the influence of decoupled freedom, there will be some offset in the sampling position for classification. At this time, the offset of sampling points may lead to the wrong move of classification sampling points to the corresponding object of red rectangle, which will affect the classification of current object. The set of sampling points on the localization branch is Ld. When the coordinate transformation function F(.) is an identity function, the sampling position Lc based on the degree of freedom degenerates to the original Ld.
Although the design of LRP-DS-V1 can decouple the sampling point set of classification and localization tasks, it may cause the classification task to confuse the object that really needs to be classified, that is, excessive decoupling. Take the left figure of Figure 5 as an example; the yellow rectangle is the current object. Ideally, the sampling position of localization will be distributed at the geometric position of the object. Due to the influence of decoupled freedom, there will be some offset in the sampling position for classification. At this time, the offset of sampling points may lead to the wrong move of classification sampling points to the corresponding object of red rectangle, which will affect the classification of current object. Therefore, LRP-DS-V2 is proposed in this paper. The basic idea of LRP-DS-V2 is to keep the sampling set Ld to sample the classification branch, but also use the free sampling method to further sample, and finally merge the two sampling results and classify them. It enables the network to perceive the current object that need to be classified, and can also achieve the liberalization of classification sampling points. As shown in Figure 4b, the LRP-DS-V2 firstly uses the sampling set Ld to sample the input feature of the classification branch to obtain the feature A, and then the feature A is performed 1 × 1 convolution to obtain the free offset feature map of sampling point. Then, the free offset feature map is employed to sample the feature map again on the input feature to obtain feature B. Finally, the feature A and feature B are fused, and the classification is carried out. Feature A indicates the object to be classified, while feature B contains the semantic information currently needed to be classified. Therefore, it can make up for the defects of LRP-DS-V1.
Finally, because the point set is taken as the sampling point of the feature and the idea of task decoupling is proposed in this paper, the pseudo box conversion function of min-max is more suitable for this method than the moment-based pseudo box conversion function.

Reduce the Mismatch between Classification Confidence and Regression Localization
Non maximum suppression (NMS) is an important part of the current CNN object detection algorithm to remove duplicate bounding boxes. The core of NMS is to select the prediction bounding box with the highest classification confidence in the same category, and eliminate the same category bounding box with IoUs greater than a certain value. Therefore, LRP-DS-V2 is proposed in this paper. The basic idea of LRP-DS-V2 is to keep the sampling set L d to sample the classification branch, but also use the free sampling method to further sample, and finally merge the two sampling results and classify them. It enables the network to perceive the current object that need to be classified, and can also achieve the liberalization of classification sampling points. As shown in Figure 4b, the LRP-DS-V2 firstly uses the sampling set L d to sample the input feature of the classification branch to obtain the feature A, and then the feature A is performed 1 × 1 convolution to obtain the free offset feature map of sampling point. Then, the free offset feature map is employed to sample the feature map again on the input feature to obtain feature B. Finally, the feature A and feature B are fused, and the classification is carried out. Feature A indicates the object to be classified, while feature B contains the semantic information currently needed to be classified. Therefore, it can make up for the defects of LRP-DS-V1.
Finally, because the point set is taken as the sampling point of the feature and the idea of task decoupling is proposed in this paper, the pseudo box conversion function of min-max is more suitable for this method than the moment-based pseudo box conversion function.

Reduce the Mismatch between Classification Confidence and Regression Localization
Non maximum suppression (NMS) is an important part of the current CNN object detection algorithm to remove duplicate bounding boxes. The core of NMS is to select the prediction bounding box with the highest classification confidence in the same category, and eliminate the same category bounding box with IoUs greater than a certain value. Then, NMS performs iteratively in the above manner. This makes the classification confidence naturally assigned two responsibilities of describing the category probability and localization accuracy. However, in the training process, no matter whether the IoU value between the predicted bounding box and the corresponding truth box is high or low, the classification confidence of positive samples is expected to be one. Therefore, the correlation between the classification confidence and localization accuracy is low, which seriously affects the localization accuracy of the model.
Although the proposed method of decoupling classification and localization tasks can be beneficial to the feature sampling of classification and location, it makes the mismatch between the classification confidence and localization accuracy more serious. Therefore, this paper hopes to introduce a localization score to describe the localization accuracy, so that the classification score only needs to bear the responsibility of category probability. In this paper, for the realization of localization score, the current popular IoU score is selected. The IoU score is obtained by the output of the IoU branch parallel to the location branch, as shown in the location score map in Figure 4. Specifically, the IoU score is obtained by a 3 × 3 convolution, and the sigmoid activation function is used to normalize the IoU score to [0,1]. In training, this paper applies supervisory signals to IoU scores by only positive samples, but does not use the more robust training methods like IoU-Net. In the process of training, the multi-task loss calculation formula of LRP-DS proposed in this paper is as follows: L = λ 1 *L cls + λ 2 *L init_cls + λ 3 *L refine_loc + λ 4 *L iou , where, L cls is the classification loss, L init_loc is the initial localization loss, L refine_loc is the fine-tuned localization loss, and L iou is the IoU score loss. The classification confidence p describes the category probability of the bounding box, and the IoU score describes the localization accuracy of the bounding box. The object score in the non-maximum suppression NMS can be recalculated by formula (6).

Experimental Details
The experimental environment of this paper is Intel Core i7 8700 CPU, NVIDIA RTX2070 GPU and Windows10. The experimental comparison is conducted on the public dataset PASCAL VOC [28]. The object detection network is trained on the VOC07 trainval + VOC12 trainval dataset, which contains 16,551 annotated images and 20 predefined object categories. In addition, the object detection network performs performance evaluation on the VOC07 test dataset.
For fair comparison, all experiments are implemented based on Pytorch and mmdetection [29] toolbox. In order to adapt to the more lightweight backbone network MobileNetV2 + FPN, this paper reduces the number of stacked convolutions from four to two, after sampling the classification branches and localization branches of RepPoints and ours LRP-DS. MobileNetV2 uses ImageNet [30] pre-training model for initialization. In mmdetection, the default learning rate is set to 0.01, and the batch size is 16 (eight GPUs and two images are processed on each GPU). Since only one GPU is available and the batch size of each GPU is four images, this paper divides the learning rate by four according to the linear scaling rule to ensure the training effect. The training strategy uses the 1x training method in mmdetection. The training optimizer uses Synchronous Stochastic Gradient Descent (SGD). The image input resolution of the network is scaled by (1000, 600). For data enhancement, only horizontal image flipping is used. Regarding the λ setting of multi-tasking loss, this paper sets λ 1 = 1, λ 2 = 0.5, λ 3 = 1, λ 4 = 0.5. The relaxation factor γ of the coordinate transformation range is simply set to 1. Compared with RepPoints, the more appropriate minmax conversion function is applied by default to convert point sets to pseudo boxes. In addition, without additional instructions, the backbone network of the object detection network adopts the MobileNetV2 and FPN structure, and all other hyperparameters follow the settings of mmdetection. Table 2 reports the results of ablation study under the PASCAL VOC07 test dataset. From Table 2, it can be found that the method of decoupling the sampling point set proposed in this paper is effective. After applying the degree of freedom decoupling, the mean average precision (mAP) can be increased from 71.4% to 73.1%. Comparing the two coordinate transformation methods proposed in this paper, the coordinate transformation function F 2 is slightly better than the coordinate transformation function F 1 , with a maximum difference of 0.7% mAP. In addition, the experimental results in Table 2 show that LRP-DS-V2 is significantly better than LRP-DS-V1. The second decoupling method proposed in this paper can increase mAP from 72.6% to 73.3%. Therefore, this also verifies that the excessive decoupling of LRP-DS-V1 affects the object perception ability of the classification task, thereby reducing the accuracy. However, the LRP-DS-V2 can make up for this shortcoming. In order to verify the necessity of independent description of localization accuracy, the comparative experiment on the IoU branch is conducted. Specifically, this paper designs an IoU branch that outputs localization scores in parallel with the refine positioning branch. The IoU branch only uses a 3 × 3 convolution, so it hardly affects the amount of calculation of the network. In the training process, in order to simplify, this paper only selects positive samples for training, and does not use more robust training methods. The robust IoU training method is complementary to the method in this paper. Table 2 shows that the RepPoints with IoU branch will not bring any benefits. However, mAP can be further increased to 73.3% for the LRP-DS with IoU branch. This proves the necessity of using classification confidence and localization scores to describe the category probability and localization accuracy of the object, respectively, in our method.

Ablation Study
In Table 3, the experimental results of the pseudo box convert function are compared for moment-based and min-max. For the RepPoints, the moment-based pseudo box conversion function has a better mAP. However, for our LRP-DS, the minmax can has a better mAP, with the maximum increase of 1.1% mAP. This is mainly due to the fact that our method puts more emphasis on the point set to guide sampling. Therefore, it is necessary that the point set can better fit the object edge and the semantic key point of the object. The moment-based conversion function has natural shortcomings as shown in Figure 4, so the pseudo box conversion function minmax is more suitable for the LRP-DS object detection network proposed in this paper.   Tables 4 and 5 show the comparison between our LRP-DS and other methods. In order to make a fair comparison, all object detection networks in the Tables 4 and 5 directly use the implementation code provided by mmdetection. And all networks are retrained using 1× training strategy in the same environment. The methods of comparison  Tables 4 and 5 show the comparison between our LRP-DS and other methods. In order to make a fair comparison, all object detection networks in the Tables 4 and 5 directly use the implementation code provided by mmdetection. And all networks are retrained using 1× training strategy in the same environment. The methods of comparison include single-stage object detection method, two-stage object detection method, anchor-free and anchor-based methods. It can be found from Table 4 that, the method proposed in this paper has the best detection performance with the same backbone network of MobileNetV2 and FPN. Table 5 reports the mAP, the GPU memory requirement, multiply-accumulate operation (MACC), the number of parameters, the detection time, and the frames per second (FPS) of the different detectors. The MACC describes the computational complexity of the model. Compared to other detectors, our method has a higher mAP, with similar computational complexity and computational speed. In addition, when backbone of Resnet50 is used, the detection time is 93.2 ms, which is nearly twice as slow as our LRP-DS. Due to GPU memory limitation, the batch size of the Double Head is set to 2.

Conclusions
This paper proposes a lightweight RepPoints with decoupled sampling point set (LRP-DS). The LRP-DS employs MobileNetV2 and FPN as the backbone network, in order to create a lightweight network and pursue fast detection speed. Considering the differences between classification and localization tasks, two classification free sampling methods, LRP-DS-V1 and LRP-DS-V2, are proposed for decoupling the sampling points of classification and localization. In order to split the responsibility of classification confidence, a localization score is introduced to describe the localization accuracy independently. The final architecture of this paper can achieve 73.3% mAP on the PASCAL VOC07 test dataset, which is better than RepPoints, Libra RCNN and other methods. The experimental results also verify the effectiveness of the proposed method LRP-DS.