Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUs

Label assignment plays a significant role in modern object detection models. Detection models may yield totally different performances with different label assignment strategies. For anchor-based detection models, the IoU (Intersection over Union) threshold between the anchors and their corresponding ground truth bounding boxes is the key element since the positive samples and negative samples are divided by the IoU threshold. Early object detectors simply utilize the fixed threshold for all training samples, while recent detection algorithms focus on adaptive thresholds based on the distribution of the IoUs to the ground truth boxes. In this paper, we introduce a simple while effective approach to perform label assignment dynamically based on the training status with predictions. By introducing the predictions in label assignment, more high-quality samples with higher IoUs to the ground truth objects are selected as the positive samples, which could reduce the discrepancy between the classification scores and the IoU scores, and generate more high-quality boundary boxes. Our approach shows improvements in the performance of the detection models with the adaptive label assignment algorithm and lower bounding box losses for those positive samples, indicating more samples with higher-quality predicted boxes are selected as positives.


Introduction
Object detection is a fundamental problem in computer vision that simultaneously classifies and localizes all objects in images or videos [1][2][3].With the fast development of deep learning, object detection has achieved great success and been applied to many real-world tasks such as object tracking [4,5], image classification [6][7][8], segmentation [9,10], self-driving [11], and medical image analysis [12,13].Generally speaking, the detection models could be categorized as two-stage detectors and one-stage detectors.The two-stage detectors utilize Region Proposal Networks (RPN) [14] to select the anchors with high probability containing some objects and refine them.Those refined anchors are employed for classification and regression in the second stage.While one-stage detection models directly classify and regress the anchors to output the final results.The two-stage detection models often have higher accuracy while lower speed for inference compared to the one-stage detectors.RetinaNet [15] discovers that the class imbalance of positives and negatives is the reason for the accuracy gap between two-stage models and one-stage models and proposes focal loss [15] to solve this problem and close the accuracy gap between two types of models while still maintaining fast inference time of one-stage detectors.Recently, one-stage detection paradigms have a dominating influence for their high accuracy and low latency.Similar to image segmentation, object detection requires fine-grained details to accurately localize the bounding box of the object.Thus, recent object detectors frequently exploit Feature Pyramid Networks (FPN) [16] so that large feature maps with fine details could recognize the small objects, while small feature maps with large receptive field could detect the large objects.
Label assignment is to divide the samples into positives and negatives, which is essential to the success of object detection models.For anchor-based models, the core element for label assignment is the threshold for the division of positive samples and negative samples.After we calculate the Intersection over Union (IoU) between anchors and ground truth bounding boxes, the positive samples are those anchors whose IoUs are larger than the threshold, while others are negatives or ignored.Sometimes the detection models design two thresholds, one for positive and the other for negative.The anchors whose IoUs are larger than the positive threshold are positive samples, while the anchors whose IoUs are smaller than the negative threshold are negative samples.Those anchors whose IoUs are between the positive threshold and negative threshold are ignored during the training process.Those early detection models [14,17] utilize fixed threshold to divide the positives and negatives.However, the algorithms with the fixed threshold for dividing the positives and negatives ignore the differences between objects for their various shapes and sizes.For instance, Some large or square-shaped objects frequently have more high-quality anchors corresponding to them while some small and slender objects which have an extreme ratio of width and height often have low-quality anchors matched to them.Simply using a fixed IoU threshold for label assignment might deteriorate the performance of detecting the objects without regular shapes.Although we could design more anchors with various shapes and sizes to alleviate the problem, the overhead and computational cost are also increased, which is not desired for the requirements that prefer low latency and less computational cost.
Recently, more and more adaptive label assignment strategies (e.g., ATSS [18]) have been proposed to adaptively calculate the threshold.These algorithms adaptively select positive samples and negative samples based on the IoU distribution between anchors and ground truth bounding boxes so that the ground truth bounding boxes that have more high-quality anchors corresponding to them will have a higher IoU threshold and those which have the most lowquality anchors corresponding to them will have a low IoU threshold, as illustrated in Figure 1.Nevertheless, adaptive assignment methods do not assign positives and negatives based on the predictions which are more accurate to represent the training status.Due to the discrepancy between the classification and localization, classification scores cannot precisely correspond to the localization quality, while NMS (non-maximum suppression) supposes classification scores represent the localization quality and filters duplicates so that only the samples with high classification scores will be kept.However, if classification scores cannot accurately represent the localization quality, the high-quality bounding boxes might be eliminated and some low-quality bounding boxes might be kept.However, fixed anchors cannot guarantee the quality of the predicted bounding boxes.
Therefore, introducing predictions to instruct label assignment is an effective approach to include the anchors which could generate high-quality predictions as the positives.For some models using dynamic label assignment strategies, predictions are utilized to guide the label assignment during training since they represent the true training status.Nonetheless, the predictions in the early stage of training are inaccurate for both classification and bounding box regression; thus, they are inappropriate to instruct the label assignment if we directly apply them to dividing positive samples and negative samples.Adding distances to the ground truth centers as prior is proposed in the algorithm which utilizes predictions to weight the positive samples.Those prediction-based label assignment strategies exploit the distances or bounding boxes as the prior so that the positives are limited to the ground truth bounding boxes or those positives samples that are closer to the centers of the ground truth bounding boxes have more weights during the training process [19,20].The predictions (classification scores or predicted boxes) and the distances are two different "domains", so they could not be naturally combined.Thus, Autoassign [19] designed a center weighting module to solve this problem; however, the module may be sub-optimal due to the assumption that the samples closer to centers of the ground truth would have more weights.MAL [21] employs "All-to-Top-1" strategy which includes enough anchors to learn the detector in the early training stage, and the number of anchors gradually decrease with the training process going on and finally only one optimal anchor is utilized.However, "All-to-Top-1" [21] reduces the number of anchors in the bag based on iterations instead of predictions.Thus, the training may not be optimal since the number of anchors in the bag is not controlled by predictions and might not satisfy the training status.If the IoU between the anchor (red box) and the ground truth (green box) is larger than 0.5, the anchor is denoted as positive to the ground truth.Middle: ATSS calculates the adaptive IoU thresholds based on the distribution of the anchors for each ground truth.If the ground truth has the most high-quality anchors (IoU to the ground truth is high), the IoU threshold for the ground truth is also high.Otherwise, the IoU threshold is low.Right: Our method also considers the predicted box (blue box) for each anchor (red box).Even though some anchors do not satisfy the IoU threshold, they might meet the threshold considering the predicted boxes and the predicted boxes could assist the model to select the positive anchors more accurately.
In this paper, we propose a simple and effective method that directly combines the predicted IoUs between the predicted bounding boxes and the ground truth bounding boxes, and the anchor IoUs between the anchors and the ground truth bounding boxes.According to the adaptive models, the adaptive threshold is attained according to the statistical properties of the IoUs between the candidate anchors and the ground truth bounding boxes.Our method computes the distribution of the predicted IoUs and the anchor IoUs separately and then attains the combined parameters by simply adding them, respectively.Finally, the combined threshold is computed by the combined distribution parameters.Since the predictions are involved in the label assignment, soft targets (predicted IoUs between the predicted bounding boxes and the ground truth boxes) are more appropriate than the hard target (label 1) for positives in classification loss.QFL [22] and VFL [23] are commonly utilized classification loss with soft target.Both of them could further boost the performance of our proposed method.In addition, we replace the Centerness branch with the IoU branch for better accuracy.
In our work, we demonstrate the importance of using predictions to include samples with higher-quality predicted boxes as positives for label assignment and propose a simple and effective approach to introduce the predictions to the definition of positive and negatives.Our method could naturally introduce the predictions to label assignment algorithms and the anchors could perform as prior for predictions.The experiments on COCO dataset [24] illustrate the effectiveness of our method without extra cost.Designing object detection models dynamically based on training status are a trend recently.While directly using the predictions is unreasonable due to the inaccurate predictions in the early stage of training.Thus, how to create prior to restrict the predictions is important but under-explored.Our idea of exploiting anchor IoUs as prior to combine with predicted IoUs could be a good choice and might inspire more related works.

Object Detection
Object detection could be categorized as two-stage approaches and one-stage approaches [25][26][27][28].Two-stage detection models [14,16,29,30] first employ Region Pyramid Networks (RPN) [14] to select anchors with high possibility to have high IoUs with some ground truth objects and refine those candidate anchors.Then the refined anchors are fed into the second stage to be classified and further regressed.While one-stage detectors [15,17,[31][32][33] directly classify and regress the anchors without selecting and refining some candidates.Two-stage detectors often yield higher accuracy but lower speed compared to one-stage counterparts.With the emergence of RetinaNet [15], the accuracy discrepancy between the one-stage and two-stage models has been reduced by introducing focal loss to suppress the loss of easy samples so that one-stage methods could achieve both high accuracy and low latency.Thus, one-stage approaches are dominating current object detection models.
With the appearance of anchor-free models [34][35][36][37][38], the pre-defined anchors are no longer required for a well-performed detection model.The anchor-free models either regress the bounding box from anchor points (feature points) [37,38] or predicts some special points of the ground truth object such as corners or extreme points of the boundary boxes of the objects [34][35][36] and finally constructs the predicted bounding boxes from those special points.Recently, some object detection models [39][40][41][42][43] are enhanced by attention modules with Transformers [44], which are originally invented for natural language processing.DETR [40] first introduces the Transformers to the head of detection models, which is also anchor free.Nonetheless, due to the global attentions utilized in Transformers and the large resolution of the images for object detection, DETR requires much longer epochs than its CNN counterparts to convergence.Thus, recent algorithms [45,46] attempt to design fast training convergence DETR to speed up the training process.

Label Assignment
The label assignment is the core factor for the performance of the detection models and how to divide positive samples and negative samples would determine how the networks learn and converge.For RPN training in Faster R-CNN [14], the pre-defined anchors whose IoUs with any ground truth bounding box higher than 0.7 will be defined as positives, and those whose IoUs with all ground truth objects lower than 0.3 are negatives.The anchors with IoUs between 0.3 and 0.7 are ignored during training and the networks do not learn from those anchors.For SSD [17], the threshold of IoU between the positive anchor boxes and the ground truth objects is 0.5.RetinaNet [15] assigns an anchor box to a ground truth object if the IoU between them is higher than or equal to 0.5, and assigns an anchor box as a negative sample if the IoUs between the anchor and all ground truth boxes are lower than 0.4.Others are ignored during training.The aforementioned strategies are traditional label assignment approaches with fixed thresholds for dividing positives and negatives.Even though those detection models with fixed thresholds are still effective for label assignment, they ignore the differences between various object samples for their shapes, sizes and number of corresponding positive anchors.For instance, with fixed thresholds for label assignment, the objects with square-like shapes or large sizes might have more high-quality anchors corresponding to them so that they have more positive anchors during training, while some objects with slender shapes or small sizes might have the majority of low-quality corresponding anchor boxes; thus, they correspond to fewer positive anchors during training.So during training, the networks will be biased toward the objects with balanced ratios of width and height or large sizes and the performance of some slender or small objects will be compromised.
Recently, researchers focus on designing adaptive thresholds and gradually discard the fixed thresholds for label assignment.ATSS [18] computes the adaptive thresholds by calculating the mean and standard deviation according to the distribution of the IoUs between the candidate anchors and the ground truth objects.PAA [47] attains the anchor scores by combining the classification scores and localization scores, and then the selected anchor candidates which are based on top anchor scores are fitted into Gaussian Mixture Model (GMM) and the GMM is optimized by Expectation-Maximization algorithm probabilistically.The candidate anchor boxes are separated by the boundary schemes as positives and negatives.
Using predictions to guide the label assignment could be more accurate since the predefined anchors might not accurately reflect the actual training status.Nevertheless, predictions in the early training stage are inaccurate and unreasonable to instruct the label assignment.FreeAnchor [48] exploits maximum likelihood estimation (MLE) to model the training process so that each ground truth could have at least one corresponding anchor with both high classification score and localization score.To solve the inaccurate predictions in the early epoch of training, the mean-max function is proposed so that almost all candidate anchors could be used in training the network and the best anchor could be selected from candidates when the training is sufficient, which is similar to "All-to-Top-1" mechanism in MAL [21].MAL [21] employs predictions from classification and localization as the joint confidence for evaluation of the anchors.To alleviate the sub-optimal anchor-selection problem, MAL perturbs the features of selected anchors based on the joint confidence to suppress the confidence of those anchor candidates so that other anchors could have a chance to be selected and participate in the training process.In addition, MAL proposes "All-to-Top-1" strategy for anchor selection that includes enough anchors to be involved in training and gradually decreases the number of anchors in the bag to 1 with the training process going on.However, linearly decreasing the number of anchors in the bag is irrelevant to the prediction status and perturbing the features of selected anchors introduces randomness which is also not correlated to the current predictions.Thus, this strategy might not be optimal to train the network.
Autoassign [19] introduces Center Weighting as the prior to address the unreasonable predictions in the early training phase, which indicates that the samples closer to the centers of ground truth would have more weights.Nonetheless, the prediction confidence and the distance to the ground truth centers are two different "domains" and the combination might be sub-optimal even though Autoassign models the distances with Gaussian-shape weighting function [19].Our model naturally combines the predicted IoUs (IoUs between the predicted boxes and the ground truth boxes) and the anchor IoUs (IoUs between the anchor boxes and the ground truth boxes) as the combined IoUs.Since they are both IoUs to the ground truth boxes, they could be naturally combined together without any complex functions or weights.In the early training stage, the anchor IoUs dominate the training process due to the random initialization.As the training process proceeds, the predicted IoUs will gradually dominate the combined IoUs and more samples with high-quality bounding boxes will be selected as positives.

Proposed Approach
The adaptive label assignment strategies frequently divide the positive and negative samples by calculating the statistical parameters (e.g., mean and standard deviation) based on the candidate anchors or anchor bags which are selected according to the Euclidean distances between the centers of the anchors to the centers of the ground truth bounding boxes.After the candidate anchors are selected based on their positions to the ground truth boxes, the adaptive thresholds are computed based on the distribution of their IoUs to the corresponding ground truth bounding boxes.In the paper, ATSS [18] is utilized as an example to illustrate the adaptive label assignment method and our proposed approach.

Revisit ATSS
ATSS (Adaptive Training Sample Selection) [18] makes an empirical analysis of the anchorfree approaches and anchor-based approaches and concludes that how to divide the positive samples and negative is the significant difference between anchor-based models and anchor-free models.Thus, ATSS proposes an algorithm that calculates the adaptive thresholds for defining the positives and negatives based on the candidate anchors which are selected according to the Euclidean distances between the centers of the anchors to the centers of the ground truth bounding boxes.The ATSS algorithm is shortly summarized as below: For each GT (ground truth) box,

1:
Compute the Euclidean distances between the centers of all anchors and the center of the GT.

2:
Select k anchors with the smallest distances for each feature pyramid level; if the number of feature levels is L, the number of total candidate anchors corresponding to the GT is kL.

3:
Compute the IoUs between kL candidate anchors and the GT box.Then calculate the mean and standard deviation of those IoUs.

4:
The adaptive threshold is mean+std.If one anchor has IoU with the GT larger than or equal to mean+std, it is candidate positive of the GT, else it is negative.5: Only the candidate positives whose centers are inside the GT box would be the final positives of the GT, others are negatives.
ATSS computes the threshold adaptively according to the shapes and sizes of the GT bounding boxes.If the GT boxes are large or square-like, the threshold will be higher since there are more high-quality anchors corresponding to them.If the GT boxes have slender shapes or small sizes, the threshold will be lower due to most low-quality anchors corresponding to them.Nevertheless, ATSS only computes adaptive thresholds according to the relationship between anchors and GT boxes.It merely relies on the anchors and ignores the actually predicted bounding boxes.In other words, the anchor with the highest IoU to the GT box cannot guarantee its predicted bounding box also has the highest IoU to the GT among all positive anchors.Thus, some samples with high-quality predicted bounding boxes might be defined as negative samples whose classification target is 0. Thus, the performance of high-quality bounding boxes is affected.We will demonstrate it in the experiments.Using predicted information may improve the accuracy of defining the positives and negatives since predictions can represent the real training status of each sample.However, directly using the predictions might not be appropriate since the predictions in the early training stage are unreasonable to instruct the positive and negative definitions.Thus, we propose a simple and effective approach to address this problem by combining the predicted IoUs to the GT and the pre-defined anchor IoUs to the GT for each training sample.

Dynamic ATSS
We propose Dynamic ATSS, which introduces the predictions to the anchors for label assignment.In the early training phase, the predictions are inaccurate due to random initialization.Thus, the anchors would act as prior to instruct the label definition.The predictions gradually dominate the combined IoUs and would lead the label assignment with the training proceeding and the predictions being improved.The network structure is illustrated in Figure 2. The network structure is the same as ATSS, which has a CNN backbone [49], an FPN neck [16] and a shared head which has two branches for classification and regression, respectively, while our proposed approach would extract the regression results and decode the regression offsets to the coordinates of bounding boxes, and finally calculate the IoUs between the decoded bounding boxes and the GTs.The predicted IoUs will be combined with anchor IoUs for selecting the positive samples.
mean(CIoUs) = mean(PIoUs) + mean(AIoUs) std(CIoUs) = std(PIoUs) + std(AIoUs) Equations ( 1)-( 4) illustrate our proposed methods.PIoUs is the predicted IoUs between the predicted bounding boxes and the ground truth boxes, and AIoUs indicates the anchor IoUs between the pre-defined anchor boxes and the ground truth boxes.CIoUs indicates the combined IoUs which is the summation of predicted IoUs and anchor IoUs.When calculating the mean and standard deviation for CIoUs, we compute them for PIoUs and AIoUs separately and sum them together, as illustrated in Equations ( 2) and (3).Finally, the thresholds of CIoUs for defining the positives and negatives are computed by summation of the mean and standard deviation of CIoUs.Since PIoUs and AIoUs are both IoUs to the ground truth bounding boxes, they can be naturally combined together by summation without designing any sophisticated formula [19] or reducing the number of positives during the training process [21].We also implement some experiments that down-weight the AIoUs or up-weight the PIoUs with the training proceeding.However, the simple addition of PIoUs and CIoUs could yield the best results, which is demonstrated in the experiments.
Why is utilizing predictions so important to guide the label assignment?The predictions are more accurate than the pre-defined anchors for defining the positives and negatives since we select the final results and implement the NMS algorithm based on the predicted results instead of the anchor boxes.We frequently design the detection models based on the assumption that the samples whose pre-defined boxes have high IoUs with the ground truth boxes are appropriate to be selected as positives or the samples whose centers are close to the centers of the ground truth objects are good candidates for positives.Once the positive samples are selected for each image, they would not be modified during the training process since the pre-defined anchor boxes or anchor points are fixed and they would not be changed according to the training status.Nevertheless, the samples with high-quality predictions might not frequently be those samples with high-quality anchor boxes or anchor points, although they have higher probabilities to generate high-quality predictions.
If we force the samples with high-quality anchor boxes or anchor points to be the positives through the entire training process, the network would focus on learning those samples even though their predictions are not good enough and ignore those samples which could generate better-predicted results but may be assigned as negatives due to the relatively low-quality anchor boxes or anchor points.While if the predictions are introduced to assist the definition of positive samples and negative samples, we could select more samples with high-quality predictions as positives and further improve those samples.From Table 1, simply adding predicted IoUs to anchor IoUs could yield better results and generate higher-quality predictions.The anchor IoUs are also necessary for our approach due to the random initialization of the network and they can act as prior.In our method, the predictions and the prior are both IoUs to the ground truth bounding boxes; thus, they can be naturally combined together by addition without any special design, which is shown in Figure 2.  The network structure of our model.The structure of our model is the same as ATSS [18], which includes a CNN backbone [49], an FPN neck [16] and a head that has two branches for classification and regression, respectively.Our approach would employ the predicted boxes which are decoded from the regression branch.Then the predicted IoUs and the anchor IoUs are attained by calculating the IoUs between the predicted boxes and the GTs, and the IoUs between the anchor boxes and the GTs, respectively.Finally, the Combined IoUs (CIoUs) are computed by summing the predicted IoUs and the anchor IoUs.The same calculation is implemented to attain the combined mean and combined std.The IoU threshold is computed by the summation of combined mean and combined std, and the positive candidates are defined as the samples whose Combined IoUs are larger than or equal to the IoU threshold.The positive candidates are restricted inside the ground truth bounding boxes as the final positive samples.

Soft Targets for Classification Loss
With the emergence of focal loss [15], most modern object detection models exploit focal loss for learning the class labels.Focal loss addresses the extreme imbalances between positive samples and negative samples during training and suppresses the majority of easy negative samples which could dominate the training loss due to the extremely large number of those easy negatives.The focal loss function for positive samples (y = 1) and negative samples (y = 0) is shown in Equation (5).
Due to the introduction of predictions for label assignment, using soft target (predicted IoUs to the ground truth boxes) is more appropriate to rank the high predicted IoUs on top of other low predicted IoUs, which is utilized in GFL [22] and VFNet [23].GFL is comprised of QFL and DFL for classification and regression, respectively.We employ QFL for classification in our model.QFL [22] for classification is illustrated in Equation (6).
From the QFL equation, the cross-entropy loss is switched to the general form for positives (y > 0) since the soft targets are not equal to 1.In addition, the focal loss weights are also modified according to the soft targets.
Instead of down-weighting the losses when the classification predictions approach to the soft targets as used in QFL, VFNet [23] exploits VFL [23] that weights the positive losses with the soft targets to which those positives are assigned, which is demonstrated in Equation ( 7).By changing the weights to the IoU targets for positives (y > 0), the losses of positive samples with higher IoU targets would be also higher so that the network could focus on learning those high-quality positives.
In the experiments, we could empirically illustrate that our proposed approach surpasses the same model using QFL or VFL in Table 1.In addition, by combining our proposed method with QFL or VFL, the performance of the detection model could be further improved.

Experiments
Implementation Details: The experiments are conducted on COCO dataset [24], where train2017 is employed for training and val2017 is for validation and ablation study.The training strategy exploits scheduler 1x (90 k iterations) for validation where the batch size is 16, and the initial learning rate is 0.01 and the learning rate is decreased by a factor of 10 at 60 k and 80 k iterations.The images are resized so that the maximum length of the longer side is 1333 and the length of the shorter side is 800.ResNet-50 [49] is utilized for all ablation study.

Ablation Study
In the experiments, ATSS algorithm [18] is selected as an example to illustrate the effectiveness of our proposed approach.We introduce the predicted IoUs to the ATSS algorithm to demonstrate the effectiveness of our approach.

The Effectiveness of Proposed Method
To prove the effectiveness of our proposed method, we did several experiments based on the ATSS algorithm.The experimental results are shown in Table 1.
From Table 1, ATSS combined with our proposed CIoUs (Combined IoUs) surpasses the same model with soft targets (QFL and VFL) for classification loss.Our simple modification can boost the original ATSS algorithm by around 0.7 AP on COCO val2017 dataset, which demonstrates that using predictions could better guide the positive and negative definitions and the anchor boxes are also necessary for instructing the label assignment, especially in the early stage of the training process.By simply combining them together, the model could yield great accuracy improvement.We simply introduce CIoUs into ATSS here and the labeled target is still the hard target (1 for positives).In the following experiments, we will show that the performance could be further boosted with the soft target (QFL or VFL).

The Contribution of Each Element
In this section, we implement the experiments by removing some of the components.Through this ablation study, we can easily recognize the contribution of each element.The experimental results are demonstrated in Table 2.
In Table 2, similar to the aforementioned Equations ( 1)-( 4), AIoUs represents the IoUs between the pre-defined anchor boxes and the ground truth bounding boxes.If only AIoUs are selected, the original ATSS is implemented.PIoUs indicate the IoUs between the predicted bounding boxes and the ground truth boxes.If both AIoUs and PIoUs are selected, our proposed Combined IoUs are implemented by summing the computed AIoUs and PIoUs.We can obviously notice that only employing PIoUs for label assignment significantly compromises the performance of the model from 39.06 AP to 29.39 AP, while simply adding the PIoUs to AIoUs for defining the positive samples and negative samples could yield around 0.7 AP improvement and the improvement happens in all metrics (AP, AP50, AP75, APs, APm, APl), which verifies that using predictions could include more candidates with high-quality predicted bounding boxes as the positive samples and finally improve the overall performance of the detection models.Figure 3 illustrates the regression loss of the original ATSS and ATSS with our proposed combined IoUs.From Figure 3, the regression loss does not have too much difference in the early training phase for both models.While with the training process going on, our approach has lower regression loss than the original model, which indicates our model could select positive samples with higher-quality bounding boxes since more accurate predicted bounding boxes would yield lower regression loss.Additionally, the average precision for large objects (APl) is greatly improved by about 2%.From Table 2, our proposed approach (AIoUs+PIoUs) could be further improved by soft targets (QFL and VFL).The original ATSS implements Centerness as the additional branch to weight the positive samples so that the samples closer to the centers of the GTs could have relatively higher weights than those far from the centers of the GTs.After switching the Centerness to IoU (predicting IoU instead of centerness), the performance could be boosted.

Balancing Predicted IoUs and Anchor IoUs
We also investigate how to balance the predicted IoUs and the anchor IoUs when calculating the dynamic thresholds.The experiments are illustrated in Table 3. From Table 3, the model with the ratio between PIoUs and AIoUs being 1:1 could yield the best accuracy.D_up indicates the dynamic weight gradually increases the weight with the training proceeding.D_down represents the dynamic weight that gradually decreases the weight with the training proceeding.In the experiment, we utilize iteration/max_iter as the D_up and (1 − iteration/max_iter) as the D_down.In the representation, iteration is the current iteration and max_iter represents the total iterations the training would implement.The initial value of D_up is close to 0 and the final value of D_up would be close to 1 when the training is approaching to the end.While the initial value of D_down is close to 1 and the final value of D_down would be close to 0, which is the opposite of D_up.
Intuitively, the weight of the predicted IoUs should gradually increase since the predicted IoUs would be more and more accurate with the training proceeding.Thus, in our experiments, we apply D_up to the predicted IoUs and D_down to the anchor IoUs so that the predicted IoUs would gradually take over the label assignment and the anchor IoUs would gradually fade out of the label assignment.Nonetheless, introducing the dynamic weights to PIoUs and AIoUs does not improve the performance and applying D_down to AIoUs greatly worsen the accuracy of the detection model, which verifies that anchors still play a significant role in our method.Since PIoUs would gradually increase with the training proceeding due to increasing localization accuracy of predicted boxes, dynamic weight has no positive effect to PIoUs.In our experiments, simple summation of PIoUs and AIoUs is applied and no weights are utilized for combining them.This ablation study also demonstrates that anchor IoUs (prior) and predicted IoUs could be naturally combined together without any sophisticated formula or weights, which is simpler than using complicated formula to combine the center distance (prior) and the predictions [19] or "All-to-Top-1" [21] that gradually decreases the number of the anchors in the bag.

Application to the State-of-the-Art
The proposed approach combines the predicted IoUs with the anchor IoUs, so it could be directly applied to some state-of-the-art models that employ the ATSS algorithm for label assignment.GFLV2 [50] and VFNet [23] are both state-of-the-art detection models that employ ATSS algorithm for defining positives, negatives, and soft targets (IoUs) for the classification loss.Instead of directly predicting the offsets of the bounding boxes, GFLV2 [50] designs a distribution-guided quality predictor to estimate the localization quality via the bounding box distributions, and VFL [23] makes a further bounding box refinement by employing a star-shaped box feature representation.We applied our method to GFLV2 [50] and VFL [23] and the experimental results are shown in Table 4.
From Table 4, we can see that after applying our proposed method to the GFLV2 and VFNet, the two state-of-the-art models are further improved on COCO val2017.The improvement comes from the dynamic label assignment strategy (ATSS with our method) which selects samples with high quality predicted boxes as the positives.Thus, the network could focus on those positives and further refine them.Figures 4 and 5 illustrate and compare the bounding box losses of GFLV2 to GFLV2+CIoUs, and VFNet to VFNet+CIoUs, respectively.Similar to the application of our method to ATSS, the models with our method yield lower bounding box losses with the training proceeding, which demonstrates that our proposed method could help the detection models to select higher-quality predicted bounding boxes that have lower bounding box losses.

Comparison to the State-of-the-Art
We apply our dynamic label assignment strategy (CIoUs+QFL+IOU branch) to the COCO test-dev [24] and compare them with the original ATSS.Table 5 illustrates the experimental results of the state-of-the-art models and our methods, where "MStrain" indicates multiscale training strategy and "*" denotes our re-implementation.Scheduler 1x indicates 90 k iterations (12 epochs) and scheduler 2x represents 180 k iterations (24 epochs) on COCO test-dev benchmark.Table 5 illustrates the experimental results of some state-of-the-art models and our methods.The NVIDIA Tesla P100 GPU is employed to test the inference time.From Table 5, we can see that the proposed method does not introduce extra cost and the real processing time for each image per GPU is the same as the original model (e.g., 80 ms indicates processing one image per GPU needs 80 milliseconds).Our Dynamic ATSS model could boost the original ATSS by around 1% in overall performance and 2% for large objects (APl).We also notice that our results are similar to state-of-the-art GFL [22], which is also based on ATSS for label assignment.While due to the introduction of distribution focal loss (DFL), GFL [22] predicts the bounding boxes multiple times for each sample, which would slightly increase the number of parameters.Our model is much simpler without introducing extra computational cost and parameters, and directly predicts the 4 offsets of the bounding boxes, just as the original ATSS.

Conclusions
In this paper, we have illustrated the advantages of using predictions for defining positive samples and negative samples for object detection models and proposed a simple and effective method to improve the performance of adaptive label assignment algorithms.By combining the predicted IoUs and anchor IoUs, the label assignment approaches could divide the positive and negative samples dynamically according to the predictions.The predicted IoUs could be combined with the anchor IoUs (prior) by summation, so the dynamic ATSS could implement label assignment based on training status and select the samples with higher quality predicted bounding boxes as the positives.When appling the proposed approach to state-of-the-art detection models with the adaptive algorithm for label assignment, the performance of the models could be further improved.
The proposed method can select samples with high quality predicted bounding boxes as the positive samples even though they have relatively low quality anchors (with low IoUs between the anchors and the ground truth bounding boxes).While most detection models divide the positive samples and negative samples based on fixed anchor boxes, those anchor boxes are always positive samples during the training process only when their IoUs to the ground truth boxes are relatively high, even though the predicted boxes are not that excellent.Our method introduces the training status (the predicted box for each sample) into the division of positives and negatives, those candidates that easily generate high quality predicted boxes are easier to be categorized as the positives, which would assist the network to focus on those high quality samples and generate more high quality bounding boxes.In addition, the proposed method does not introduce extra computational costs and the inference time remains the same as the original methods, while our model increases the accuracy of the detection.
Since our method forces the network to pay much attention to the samples with high quality predicted boxes, the improvement for AP50 and small objects is less than that for AP75 and large objects.We are working to improve the algorithm to further boost its performance for AP50 and small objects.Prediction-based label assignment algorithms for object detection are significant to select high-quality positive samples according to training status.We hope our simple and effective approach could inspire more work on designing dynamic object detectors based on training status.

Figure 1 .
Figure 1.Illustration of label assignment strategies of RetinaNet, ATSS, and our method.The green box is the ground truth bounding box and the red and blue boxes are anchors and predicted boxes by corresponding anchors, respectively.Left: RetinaNet employs a fixed IoU threshold of 0.5 to select the positive anchors.If the IoU between the anchor (red box) and the ground truth (green box) is larger than 0.5, the anchor is denoted as positive to the ground truth.Middle: ATSS calculates the adaptive IoU thresholds based on the distribution of the anchors for each ground truth.If the ground truth has the most high-quality anchors (IoU to the ground truth is high), the IoU threshold for the ground truth is also high.Otherwise, the IoU threshold is low.Right: Our method also considers the predicted box (blue box) for each anchor (red box).Even though some anchors do not satisfy the IoU threshold, they might meet the threshold considering the predicted boxes and the predicted boxes could assist the model to select the positive anchors more accurately.

Figure 2 .
Figure 2.The network structure of our model.The structure of our model is the same as ATSS[18], which includes a CNN backbone[49], an FPN neck[16] and a head that has two branches for classification and regression, respectively.Our approach would employ the predicted boxes which are decoded from the regression branch.Then the predicted IoUs and the anchor IoUs are attained by calculating the IoUs between the predicted boxes and the GTs, and the IoUs between the anchor boxes and the GTs, respectively.Finally, the Combined IoUs (CIoUs) are computed by summing the predicted IoUs and the anchor IoUs.The same calculation is implemented to attain the combined mean and combined std.The IoU threshold is computed by the summation of combined mean and combined std, and the positive candidates are defined as the samples whose Combined IoUs are larger than or equal to the IoU threshold.The positive candidates are restricted inside the ground truth bounding boxes as the final positive samples.

Table 1 .
The effectiveness of the proposed method.

Table 2 .
The ablation study for each element.

Table 3 .
The ablation study on the weights for combination.

Table 4 .
The application of the proposed approach to state-of-the-art models.

Table 5 .
Evaluation results on COCO test-dev.