Mixed Label Assignment Realizes End-to-End Object Detection

Chen, Jiaquan; Shao, Changbin; Su, Zhen

doi:10.3390/electronics13234856

Open AccessArticle

Mixed Label Assignment Realizes End-to-End Object Detection

by

Jiaquan Chen

,

Changbin Shao

and

Zhen Su

^*

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4856; https://doi.org/10.3390/electronics13234856

Submission received: 19 November 2024 / Revised: 3 December 2024 / Accepted: 4 December 2024 / Published: 9 December 2024

(This article belongs to the Special Issue Applications of Computer Vision, 3rd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Currently, detectors have made significant progress in inference speed and accuracy. However, these detectors require Non-Maximum Suppression (NMS) during the post-processing stage to eliminate redundant boxes, which limits the optimization of model inference speed. We first analyzed the reason for the dependence on NMS in the post-processing stage. The result showed that a score loss in a one-to-many label assignment leads to the presence of high-quality redundant boxes, making them difficult to remove. To realize end-to-end object detection and simplify the detection pipeline, we propose herein a mixed label assignment (MLA) training method, which uses one-to-many label assignment to provide rich supervision signals, alleviating the performance degradation, and we eliminate the need for NMS in the post-processing stage by using one-to-one label assignment. Additionally, a window feature propagation block (WFPB) is introduced, utilizing the inductive bias of images to enable feature sharing in local regions. Through these methods, we conducted experiments on the VOC and DUO datasets; our end-to-end detector MA-YOLOX achieved 66.0 mAP and 52.6 mAP, respectively, outperforming the YOLOX by 1.7 and 1.6. Additionally, our model performed faster than other real-time detectors without NMS.

Keywords:

non-maximum suppression; end-to-end object detection; label assignment; YOLOX

1. Introduction

Object detection is a fundamental task in the field of computer vision, requiring the identification and localization of objects within images. Early object detection methods evolved from two-stage [1,2] to one-stage [3,4]. The two-stage methods utilize Region Proposal Networks (RPNs) to generate a set of candidate regions containing objects, while one-stage methods directly produce dense prediction boxes to achieve object localization, simplifying the object detection process and speeding up inference. In recent years, from Anchor-Based [5,6] to Anchor-Free [7], lightweight and high-performance network structures have continuously simplified models, achieving significant progress and superior performance. However, the series of prediction boxes generated by detectors often contains a large number of redundant results, necessitating the filtering of these redundant boxes during the post-processing stage, which involves the manually designed component known as Non-Maximum Suppression (NMS). NMS effectively removes redundant boxes with high overlap by calculating the Intersection over Union (IoU) between prediction boxes for the same object, ensuring that only the optimal detection result for each object is retained. Conventional detectors rely on NMS during the post-processing stage, making the detection process cumbersome and not truly end-to-end.

Recently, Transformer-based object detectors (DETR, Detection with Transformer) [8] can directly predict without NMS, removing various manually designed components, greatly simplifying the pipeline of object detection, and achieving end-to-end object detection. DETR uses a bipartite graph matching algorithm to find a positive sample for each ground truth box, achieving end to-end object detection. However, the high computational cost of DETR limits its effectiveness and prevents it from being fully utilized, while CNN-based detectors have achieved a reasonable trade-off between detection speed and accuracy, which can be further enhanced if NMS is not required. DeFCN [9] and OneNet [10] each achieve end-to-end object detection using fully convolutional networks, demonstrating that one-to-one label assignment is crucial for implementing end-to-end. However, the training method of one-to-one label assignment leads to a decline in the detector’s performance. Ref. [11] introduces a PSS module in the detection head to replace NMS in FCOS [12] detector, by selecting a single positive sample for each instance, while this approach increases the complexity of the detection head structure. To address these issues, we propose a new end-to-end training method that maintains both the superior performance and unchanged structure of the detection head.

CNN-based object detectors generate multiple nearly redundant predictions for each object, and the post-processing stage uses Non-Maximum Suppression to select the optimal prediction boxes as detection results. We delved into the reasons for the detector’s reliance on NMS by examining the post-processing algorithm flow and discovered that the score loss under one-to-many label assignment is a key factor causing numerous redundant boxes that cannot be easily eliminated. Previous works have demonstrated that employing one-to-one label assignment to eliminate NMS supports this finding. However, one-to-one label assignment can lead to a significant decline in detector performance. To address this issue, we propose an end-to-end training method called mixed label assignment (MLA). This approach uses one-to-one score loss to prevent the generation of high-quality redundant boxes, eliminating the need for NMS and realizing end-to-end object detection. It also retains the one-to-many bounding box regression loss, which provides rich supervisory information to optimize the model and alleviate performance degradation. Additionally, DETR achieves strong competitive results through the use of attention mechanisms. However, due to the high computational cost and memory explosion caused by Attention [13] operating on large-scale features, it is challenging to embed into every layer of the feature extraction stage. To leverage the inductive bias of convolutions and images for feature propagation in local areas, we propose a window feature propagation block (WFPB) that enhances the feature sharing capability, making it more suitable for the feature extraction stage.

The main contributions of this paper are as follows:

Propose a novel end-to-end training method, mixed label assignment, which eliminates the need for NMS and simplifies the detection pipeline;
Introduce a window feature propagation block that is better suited for the feature extraction stage, enhancing local feature sharing;
Conduct extensive comparative and ablation experiments on the PASCAL VOC and DUO, demonstrating the superiority and effectiveness of the proposed method.

2. Related Work

2.1. End-to-End Object Detection

Carion et al. [8] firstly proposed the Transformer-based object detection model DETR by using Hungarian matching to achieve one-to-one label assignment as DETR realizes end-to-end object detection. DETR eliminates manually designed components needed in traditional detectors, such as Non-Maximum Suppression algorithms, thus simplifying the object detection pipeline. However, DETR still has two issues: heavy computational burden and slow training convergence. Although Deformable DETR [14] reduces the computational cost of Transformers by using deformable attention, and DAB-DETR [15] accelerates training convergence by replacing queries with dynamic anchor box representations, DETR’s training cost and inference speed are still significantly higher than CNN-based object detectors.

Convolutional networks can also achieve end-to-end detection. RelationNet [16] introduces a relationship module to learn the relationships between ground truth boxes, and uses relationship scores instead of IoU for redundant box filtering. Learnable NMS [17] turns NMS into a learnable module, making the entire network learnable and achieving end-to-end training. Sparse R-CNN [18] replaces dense anchor boxes with a set of sparse proposal boxes, directly outputting the final detection results without the need for NMS post-processing. DeFCN and OneNet adjust the one-to-many assignment to a one-to-one label assignment, successfully eliminating the need for NMS in the post-processing stage and proposing dynamic matching costs. Zhou et al. [11] added a PSS head to an FCOS, automatically selecting a single positive sample for each instance and achieving end-to-end object detection without altering the original training method.

2.2. Label Assignment

Object detection based on deep learning trains models by computing the loss between predicted samples and ground truth labels, enabling detectors to acquire detection capabilities. For each image, object detectors generate a series of prediction boxes. How to select appropriate samples to optimize the model has been a key research focus. In early works, researchers selected positive and negative samples based on the position of prediction boxes [19] or their IoU with ground truth boxes [20], and multiple positive samples were assigned to each target, providing rich supervisory signals and accelerating model convergence, known as one-to-many label assignment. However, relying solely on the position information of prediction boxes as an assignment strategy is often not optimal. ATSS [21] proposes a high-performance method for defining positive and negative samples by calculating thresholds to select them. OTA [22] treats label assignment as an optimal transport problem and dynamically estimates the number of positive samples. YOLOX simplifies it to SimOTA. TOOD [23] combines regression and classification information from predicted boxes to form a cost matrix for selecting positive samples. DETR replaces one-to-many label assignment with one-to-one label assignment to eliminate redundancy.

Although one-to-one label assignment avoids redundant high-scoring boxes, it also brings some drawbacks, such as less supervision for each instance. Subsequent research addresses this issue [24,25,26,27]. For example, DN-DETR uses denoising training, randomly adds noise, and reconstructs it into ground truth labels. MS-DETR employs parallel decoders for one-to-many supervision. DeFCN uses auxiliary loss. These methods provide richer supervisory signals by dual-label assignment without altering model inference, but they introduce additional computational overhead during the training phase due to the use of auxiliary branches.

3. Methods

Figure 1 presents MA-YOLOX’s structure and training methodology. While the object detection network employs one-to-many label assignment (Baseline) training to provide robust feature representations, this approach generates redundant boxes that typically require NMS filtering for final detection results. Through the analysis of the post-processing stage, we propose mixed label assignment (MLA), a novel end-to-end training method that eliminates NMS requirements. Additionally, we integrate a window feature propagation block (WFPB) to enhance local feature sharing and boost performance. These innovations enable the detector to achieve superior detection results without NMS post-processing.

3.1. Mixed Label Assignment

3.1.1. Task Assignment

For a series of prediction boxes generated by object detection models, during the post-processing stage, predictions are first selected with confidence scores above a threshold and then NMS is applied to filter the remaining predictions. Traditional detectors eliminate numerous low-quality prediction boxes after the first step. However, some high-quality redundant boxes for the same object remain, which can only be removed through NMS. If the algorithm can remove all redundant boxes in the first step and obtain the desired prediction boxes, NMS would be unnecessary.

Why are there redundant high-quality prediction boxes? We speculate that this is caused by training under one-to-many label assignment. To verify this hypothesis, we visualized the confidence heatmaps of the model under one-to-many and one-to-one matching training methods. The results are shown in Figure 2. It can be observed that under the one-to-many matching training method, multiple prediction locations for an object in the image have high confidence. With only confidence-based filtering, selecting the optimal prediction box becomes challenging. In contrast, with one-to-one matching, each object corresponds to a single prediction sample, and the confidence of the surrounding prediction boxes is suppressed. This approach eliminates the need for NMS to filter redundant boxes, achieving end-to-end detection. DeFCN and DETR demonstrated that one-to-one label assignment is crucial for end-to-end object detection. This paper explores which element of one-to-one label assignment plays a decisive role.

The post-processing stage filters the predicted boxes solely based on the confidence score, regardless of the quality of the predicted boxes. We hypothesize that the scoring loss in one-to-one label assignment plays a decisive role. To verify whether bounding box regression under different label assignments affects end-to-end results, we also visualize confidence heatmaps under mixed label assignment, as shown in Figure 2. Mixed label assignment maintains single-prediction samples per target, similar to one-to-one assignment. Therefore, using one-to-one or one-to-many label assignment for bounding box regression does not affect the elimination of post-processing. Previous work achieved end-to-end detection results using complete one-to-one label assignment, but detector performance decreased significantly. Conventional detectors use one-to-many label assignment to provide sufficient foreground samples during training, leading to powerful feature representations and rich supervision signals, also causing redundant prediction boxes. Without NMS, detector performance drops dramatically. Therefore, we propose a mixed label assignment (MLA) training approach, as shown in Figure 3. By simultaneously optimizing the model with both one-to-many and one-to-one label assignments, we could leverage the advantages of one-to-many assignment while avoiding redundant box generation, maintaining superior detection performance and achieving end-to-end object detection.

3.1.2. Matching Cost

In the model training stage, positive and negative samples need to be selected in order to optimize the model. They are usually selected by designing matching cost to choose appropriate samples. For instance, Faster RCNN uses the Intersection over Union (IoU) between ground truth (GT) boxes and predict boxes as a matching cost, whereas YOLOV5 compares the distances between GT and predicted box centers. These matching methods may not be optimal, considering only location information and ignoring classification information. If suboptimal predictions are assigned as positive samples, it may complicate the model’s convergence, and OneNet proposes the matching cost without classification information, which is one of the reasons hindering end-to-end. To select the best positive and negative samples, for the sample i and target j, the matching cost is defined as follows:

C_{i, j} = C_{s c o r e} (i, j) + λ \cdot C_{l o c} (i, j) .

(1)

The final matching cost is obtained by weighting the location cost and the score cost. As shown in Equation (2), the score cost calculates the category score and confidence for the predicted sample i and the cross-entropy loss with the target j. Equation (3) shows the location cost, the IoU loss between the predicted box

\hat{b}

for sample i, and the ground truth box b is denoted as

C_{l o c} (i, j)

. By combining both, we select prediction boxes with high score and high IoU as positive samples, avoiding special cases where boxes with high IoU but low score are considered positive samples, this leads to incorrect optimization objectives. And

λ

balances the roles of regression and classification.

C_{s c o r e} (i, j) = L ({\hat{c}}_{i} \cdot c o n f_{i}, c_{j})

(2)

C_{l o c} (i, j) = - l o g (I O U (\hat{b}, b))

(3)

Mixed label assignment includes one-to-one and one-to-many dual-label assignments. One-to-one label assignment only requires selecting one positive sample for each instance. As shown in Equation (4), it calculates the cost between instance

σ

and predicted sample j. N and G represent the number of predictions and the number of image instances. For each instance j, the cost is minimized to match the best positive sample

\hat{σ}

. For one-to-many label assignment, using the same assignment method as Baseline [28], K suboptimal samples are selected for each instance from the cost matrix, where K is obtained by SimOTA.

\hat{σ} = \underset{σ \in N}{arg min} \sum_{j}^{G} C o s t_{⁠} (j, σ)

(4)

3.2. Window Feature Propagation Block

In addition to mixed label assignment, this paper designs an efficient and embeddable architecture to achieve more competitive end-to-end detection. DETR achieves strong competitive performance with Transformers; however, self-attention leads to an exponential increase in computation as the input size grows. Using self-attention in the backbone feature extraction network often faces limitations in computational resources. This paper introduces a new module, the window feature propagation block (WFPB), to replace the Attention.

For image, convolution operations have locality and translation invariance compared to Transformers. Convolutional neural networks, due to their inherent inductive biases [29], do not rely on global information; when they capture partial features of an object, they can infer the characteristics of adjacent areas using prior knowledge, aiding in recognizing or locating targets through local features. MAE [30] reconstructs images from high-masked images using only few positional information, inferring global information from local cues. Swim Transformer [31] limits self-attention computation to non-overlapping local windows to alleviate the computational pressure on large-scale feature maps, achieving outstanding results and demonstrating the effectiveness of local features.

This article presents the performance of feature propagation in local areas as a substitute for attention operations. The structure of the window feature propagation block (WFPB) is shown in Figure 4. First, for the feature f, a convolution with a kernel size of 1 reconstructs features in the spatial channel and reduces the computation for subsequent feature propagation. By combining convolution and pool for feature propagation, in the case of using a K × K convolution with a stride of 1, extracting features from the feature map in a sliding window will result in partial overlap of adjacent window areas during the sampling process. Shared features are then extracted to achieve feature sharing. After convolution sampling, each window implicitly contains features from the surrounding windows. At the same time, by using pooling operations, important features from the local area are sampled. By summing, the window can capture important features within the local area, playing a role in feature propagation. At the same time, using large kernel pooling operation, important features from the local area are sampled. By adding, the window can capture important features within the local area, facilitating feature propagation. The final step is to restore the channels using a 1 × 1 convolution and employ residual connections [32] to enhance stability. Setting the K × K size of convolution kernel serves as the window size.

The window feature propagation block in the feature extraction phase of the network is incorporated. The backbone of YOLOX is composed of the basic dark module [33]. The WFPB is applied after the convolution downsampling of the dark module, performing feature propagation and cropping on the sampled feature map to achieve the best results.

4. Experiments

4.1. Applied Datasets

The public datasets PASCAL VOC [34] and DUO dataset [35] were chosen for model evaluation. The PASCAL Visual Object Classes is a world-class computer vision challenge that was proposed in 2005, which includes the task of object detection. It features a total of 20 detection categories: person, bird, cat, cow, dog, horse, sheep, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and monitor. These categories encompass commonly seen objects in daily life, including people, animals, household items, and transportation vehicles. It provides a wealth of resources for the development of object detection. This paper uses the 2012 training and validation set, consisting of 11,540 images for training, and tests the results on the 2007 test set of 4952 images.

In contrast to VOC, the DUO dataset is an underwater object detection dataset. The DUO dataset was proposed by the Underwater Robot Professional Contest in 2021, aimed at robot picking based on underwater images. It contains about 6671 images in the training set and 1111 images in the testing set. The dataset includes four categories of underwater targets: holothurians, echinoderms, scallops, and starfish. These two datasets cover normal terrestrial scenes and underwater scenes, respectively. Achieving excellent results on both diverse datasets can better reflect the superiority of the method.

4.2. Experimental Settings

We implemented our model based on the YOLOX framework, with Python 3.9, Pytorch 2.0.1, and CUDA Toolkit 12.4. The model was trained and tested on two NVIDIA RTX3090 GPUs, and Latency was calculated on an RTX4090 GPU with TensorRT. We used SGD as our optimizer with a 0.01 initial learning rate. All our models were trained from scratch for 500 epochs with batch 16, with other details following those in [28]. In particular, to verify the effectiveness of the end-to-end training method, we evaluated our model without using NMS (W/O NMS).

4.3. Evaluation Metrics

This paper uses mAP and Latency to measure model accuracy and speed. mAP represents the average accuracy of all classes, as described in Equation (7). Latency refers to the time taken by the detector from receiving the image to producing the detection result. The calculation formula is as follows:

P = \frac{T P}{T P + F P}

(5)

R = \frac{T P}{T P + F N}

(6)

m A P = \frac{1}{N} \sum_{i = 1}^{N} \int_{0}^{1} P (R) d R

(7)

Latency = {Latency}^{f} + T_{p o s t p r o c e s s}

(8)

The number of correctly predicted positive samples is denoted as

T P

(True Positive),

F P

(False Positive) represents the incorrectly predicted positive samples, and

F N

(False Negative) represents the mistakenly predicted negative samples. Latency

⁠^{f}

is the time required for the model’s forward propagation, plus the post-processing time, resulting in the total detection Latency. The combination of accuracy and speed can well reflect the superiority of detector performance.

4.4. Comparisons to State-of-the-Art

To prove the effectiveness and efficiency of the proposed method, several representative detectors were adopted for comparison on benchmark datasets, including Faster R-CNN [20], Cascade R-CNN [36], and RepPoints [37] for two-stage detection; FCOS [12], ATSS [21], and GFL [38] for one-stage detection; and YOLOv5 [19] and YOLOv7 [39] for real-time detection. All methods used the original settings and were trained from scratch with a resolution of 640 × 640.

Table 1 shows the experimental results of different methods on the DUO dataset. Using the COCO evaluation metrics [40], in addition to the conventional Average Precision (AP) and AP at IoU threshold 0.5 (AP

⁠_{50}

), the evaluation metrics for the DUO dataset also included detection results for small, medium, and large objects (AP

⁠_{S}

, AP

⁠_{M}

, AP

⁠_{L}

), providing a more comprehensive comparison of the model’s performance in various aspects. It can be observed that our method significantly outperforms other benchmark networks. Compared to two-stage and one-stage networks, MA-YOLOX achieved better results with 66.0 mAP using fewer parameters; this is also attributed to YOLOX’s excellent design. However, compared to YOLOX, with an improvement of 1.7 mAP, our detector further enhanced detection performance, significantly outperforming other real-time detectors. Among the detection results for large, medium, and small objects, MA-YOLOX showed greater improvement in the more challenging medium and small targets, with a 3.3 increase for small objects and a 1.9 increase for medium objects, significantly enhancing the model’s ability to detect small objects.

We also compared the results of various real-time detectors YOLOv5, YOLOv7, and YOLOV8 [41] on Pascal VOC, as shown in Table 2. VOC was derived from various images in natural scenes and included many common categories from daily life, making the results more reflective of the detector’s performance in real-world scenarios. The best result is displayed in bold in the table; the detection results still outperformed other detectors. During the evaluation of the detector, NMS was not used, demonstrating the effectiveness of the mixed label assignment. Additionally, Table 2 also shows the inference speeds of various real-time detectors. To present the results more intuitively, Figure 5 visualizes the detection results in terms of speed and accuracy. Although MA-YOLOX’s number of parameters or computational complexity was not the lowest, our method demonstrated a significant advantage in inference speed, achieving a detection speed of 2.5 ms per image. This speed improvement is attributed to the removal of NMS in the post-processing stage, which eliminates the additional computational overhead introduced by NMS. As a result, MA-YOLOX is faster than other real-time detectors. Specifically, compared to YOLOX, MA-YOLOX reduces post-processing time by 0.5 ms by removing NMS, increasing processing speed by 50%. In comparison, MA-YOLOX is also faster than the YOLOV5, YOLOV7, and YOLOV8 real-time detectors. This result further proves our core idea: removing NMS is not only theoretically feasible, but it also significantly optimizes inference speed in practical applications.

To illustrate the importance of NMS in the post-processing stage, we visualize the detection results of YOLOX with and without NMS and ours, as shown in Figure 6. YOLOX and our detector can effectively detect objects in images; however, without NMS, YOLOX generates multiple predicted boxes for the same object, which affects detection performance. Especially in dense scenes, a large number of redundant boxes can obscure the original information in the image. For conventional detectors, NMS is particularly important. MA-YOLOX achieves the same results as Baseline that rely on NMS, even without using NMS. Extensive experiments have demonstrated the effectiveness and feasibility of the method.

4.5. Ablation Study on VOC

Table 3 shows the ablation results based on YOLOX-S, and Figure 7 provides a more intuitive demonstration of the detector’s performance with and without NMS. It can be observed that the performance of conventional detectors significantly declined without NMS. Implementing end-to-end with one-to-one label assignment and the mixed label assignment allowed the model to operate without NMS in the post-processing stage, making the inference speed 0.5 ms faster than the Baseline. However, the one-to-one label assignment led to a significant decline in the model’s detection results, from 51.0 to 44.9 mAP. In contrast, the mixed label assignment improved performance by 5 AP compared to the one-to-one label assignment, reaching 50.6 mAP. Additionally, the window feature propagation block improved the performance by about 2.0 mAP based on the mixed label assignment, exceeding the Baseline. The increase in parameters and computational cost was only around 10%. Our method achieved a result of 52.6 AP and a Latency of 2.5 ms without NMS, improved the detection results by 1.6 AP, and increased the inference speed by 0.4 ms, outperforming the Baseline in both performance and speed. Moreover, without NMS, there was a slight improvement in detection results, indicating that NMS removed some more accurate predicted boxes, effectively proving the feasibility and effectiveness of this approach.

4.5.1. Analyses for Mixed Label Assignment

The training strategy for mixed label assignment involves both one-to-one and one-to-many dual-label assignment, where one key aspect is how to allocate positive and negative samples. Choosing the right positive and negative samples is crucial for model training. As shown in Equation (1), the hyperparameter

λ

in the matching cost controls the importance ratio between classification and regression. Different values of

λ

lead to the model optimizing different positive samples during the training, which impacts the model’s performance. Table 4 presents the model training results under various hyperparameters, with

λ

= 5 achieving the best detection results. When

λ

is either too high or too low, it has a certain effect on model training, with

λ

= 1 reaching the worst results. This indicates that during the assignment process, more consideration should be given to the location of the predicted boxes. However, if

λ

is too high, ignoring the importance of classification negatively affects the model’s performance.

4.5.2. Analyses for Window Feature Propagation Block

The structure of the window feature propagation block is shown in Figure 4. The parameter K determined the window size for feature propagation, and we tested the model’s performance for K = 1, 3, 5, 7, as shown in Table 5. It can be observed that the performance for K = 3, 5, 7 was significantly better than for K = 1, demonstrating the feasibility of the window feature propagation block. However, as the window size K increased, the parameters and computational cost also increased significantly, and we found that the model performance improved slowly. This indicates that some redundant features are captured as the window expands. The best result was obtained when K = 7, but the number of parameters increased too much, leading to the final decision of setting the window size K to 3.

Additionally, to demonstrate the superiority of the window feature propagation block, we compared the detection results of WFPB with other attention modules, CBAM [42], CA [43], and PSA [44], under the same conditions, as shown in Table 6. CBAM (Convolutional Block Attention Module) and CA (Channel Attention) both implement attention mechanisms through convolution in spatial and channel dimensions. In contrast, PSA (partial self-attention) is based on self-attention, making it suitable for image Attention mechanisms. WFPB achieved 53.7, clearly outperforming the other modules.

4.6. Generalization Experiments

Mixed label assignment is a completely new end-to-end training method that can be effectively transferred to other advanced models. We chose to conduct generalization experiments on the YOLOV8 model, applying mixed label assignment. The results shown in Table 7 demonstrate the successful removal of the reliance on NMS in the post-processing phase of traditional detectors. The original end-to-end training method (O2O) suffered from significant performance degradation due to the lack of sufficient foreground samples. However, mixed label assignment effectively alleviated this issue. The generalization experiments further highlight the effectiveness of this method.

5. Conclusions

This paper proposes a novel end-to-end training method called mixed label assignment, which avoids the performance degradation caused by one-to-one label assignment in traditional methods while preserving the advantages of one-to-many label assignment. This approach does not require additional branches or training overhead, significantly improving the performance of end-to-end object detection. Furthermore, the window feature propagation module effectively shares features in local regions by leveraging inductive bias, and has achieved remarkable results. Our experiments demonstrated the importance of local region features in image-based detection tasks. The detector based on our method outperformed the Baseline in both detection results and inference speed. We hope that the design introduced in this work will contribute to the development of better end-to-end training methods for object detection.

Author Contributions

Conceptualization, J.C. and C.S.; Methodology, J.C.; Software, J.C.; Validation, J.C.; Formal Analysis, J.C.; Investigation, J.C.; Resources, J.C.; Data Curation, J.C.; Writing—Original Draft Preparation, J.C.; Writing—Review and Editing, J.C., C.S. and Z.S.; Visualization, J.C.; Supervision, Z.S.; Project Administration, Z.S.; Funding Acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jiangsu Provincial Key Research and Development Program (No. BE2022136), High-tech Ship Research Projects (No. CBG4N21-4-3), Key Research and Development Program of Zhenjiang City (No. GY2023019).

Data Availability Statement

The PASCAL VOC Datasets were from http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html (accessed on 25 May 2024) and the DUO dataset can be found at https://github.com/chongweiliu/DUO (accessed on 25 May 2024).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Wang, J.; Song, L.; Li, Z.; Sun, H.; Sun, J.; Zheng, N. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15849–15858. [Google Scholar]
Sun, P.; Jiang, Y.; Xie, E.; Shao, W.; Yuan, Z.; Wang, C.; Luo, P. What makes for end-to-end object detection? In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 9934–9944. [Google Scholar]
Zhou, Q.; Yu, C.; Shen, C.; Wang, Z.; Li, H. Object Detection Made Simpler by Eliminating Heuristic NMS. arXiv 2021, arXiv:2101.11782. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597. [Google Scholar]
Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
YOLOv5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 May 2024).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 303–312. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Jia, D.; Yuan, Y.; He, H.; Wu, X.; Yu, H.; Lin, W.; Sun, L.; Zhang, C.; Hu, H. Detrs with hybrid matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19702–19712. [Google Scholar]
Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 6633–6642. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Zhao, C.; Sun, Y.; Wang, W.; Chen, Q.; Ding, E.; Yang, Y.; Wang, J. MS-DETR: Efficient DETR Training with Mixed Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 17027–17036. [Google Scholar]
Ge, Z. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Dosovitskiy, A. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A dataset and benchmark of underwater object detection for robot picking. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. pp. 740–755. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 25 May 2024).
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]

Figure 1. MA-YOLOX network architecture.

Figure 2. Visualization of confidence heatmaps predicted by various methods. The image is sourced from the VOC2007 test set and contains examples: ‘Man’ and ‘horse’. The methods are sequentially trained using one-to-many matching (Baseline), one-to-one matching (O2O), and mixed label allocation (MLA) proposed in this paper. The heatmaps represent the confidence scores for predictions at the “P4, P5” scale. It can be found that the O2O and MLA methods significantly reduce the redundant prediction of the same object compared to Baseline.

Figure 3. The end-to-end training method of mixed label assignment (MLA). For input images, based on the model’s predictions of regression, classification, and confidence information, compute the matching cost and choose positive samples, then use the best positive sample ‘1’ to optimize the confidence prediction head by one-to-one label assignment; the positive samples ‘1, 2, 3, 4’ are optimized by performing one-to-many matching for the regression and classification.

Figure 4. (a) The basic composition of the YOLOX backbone’s dark module; (b) the window feature propagation block (WFPB).

Figure 5. Visualization of model detection results for the VOC dataset.

Figure 6. Visualization of detection results with or without NMS by YOLOX and MA-YOLOX.

Figure 7. Visualization of ablation experiment results.

Table 1. Comparison of different methods on the number of parameters and the accuracy on the DUO dataset; w/o NMS means without NMS during the validation (val) stage.

Method	Param.	mAP	mAP $⁠_{50}$	mAP $⁠_{75}$	mAP $⁠_{S}$	mAP $⁠_{M}$	mAP $⁠_{L}$
Faster R-CNN	41.14	54.8	75.9	63.1	53.0	56.2	53.8
Cascade R-CNN	68.94	55.6	75.5	63.8	44.9	57.4	54.4
RepPoints	36.60	56.0	80.2	63.1	40.8	58.5	53.7
RetinaNet	36.17	49.3	70.3	55.4	36.5	51.9	47.6
FCOS	31.84	53.0	77.1	59.9	39.7	55.6	50.5
ATSS	31.89	58.2	80.1	66.5	43.9	60.6	55.9
GFL	32.04	58.6	79.3	66.7	46.5	61.6	55.6
YOLOV5	7.07	63.9	84.5	71.8	43.7	65.4	63.0
YOLOX	8.94	64.3	83.7	72.2	49.7	66.3	62.6
YOLOV7	6.02	62.3	83.5	70.5	46.6	63.7	61.7
Ours (w/o NMS)	10.07	66.0	85.1	73.1	53.0	68.2	63.9

The best results are in bold.

Table 2. Comparisons of different methods on the VOC dataset. Latency

⁠^{f}

denotes the Latency in the forward process of the model without post-processing.

Table 2. Comparisons of different methods on the VOC dataset. Latency

⁠^{f}

denotes the Latency in the forward process of the model without post-processing.

Method	Param. (M)	GFLOPS	mAP $⁠_{50}$	Latency (ms)	Latency $⁠^{f}$ (ms)
YOLOV5	7.11	16.3	73.4	3.4	2.3
YOLOX	8.95	26.8	74.6	2.9	1.8
YOLOV7	6.06	13.2	70.9	2.6	1.5
YOLOV8	11.14	28.7	75.6	2.8	1.7
Ours (w/o NMS)	10.08	29.5	75.9	2.5	1.9

The best results are in bold.

Table 3. Ablation studies on VOC, evaluating model use conf = 0.001 and IoU threshold = 0.65.

Model	End-to-End	MLA	WFPB	Param. (M)	GFLOPS	mAP	mAP(w/o NMS)	Latency (ms)
YOLOX-S				8.95	26.80	51.0	18.6	2.9
	✓			8.95	26.80	44.9	44.9	2.4
	✓	✓		8.95	26.80	50.6	50.7	2.4
	✓	✓	✓	10.08	29.57	52.6	52.7	2.5

The best results are in bold.

Table 4. The results of matching cost training under different hyperparameters

λ

.

Table 4. The results of matching cost training under different hyperparameters

λ

.

$λ$	mAP	mAP $⁠_{50}$
1	50.1	73.1
3	50.2	73.7
4	50.4	73.5
5	50.6	73.7
7	50.4	73.3

The best results are in bold.

Table 5. The results of the window feature-propagating blocks at different K sizes in VOC.

Model	K Size	mAP
YOLOX-S	1 × 1	52.56
	3 × 3	53.70
	5 × 5	54.06
	7 × 7	54.31

The best results are in bold.

Table 6. Comparison of different Attention methods.

Method	mAP	mAP $⁠_{50}$
Baseline	51.0	74.6
WFPB	53.7	77.4
CBAM	50.1	75.2
CA	52.0	76.2
PSA	52.7	76.9

The best results are in bold.

Table 7. Generalization experiments.

Model	Assignment	mAP (W/o NMS)	mAP $⁠_{50}$ (W/o NMS)
YOLOV8	Baseline	22.8	28.7
	O2O	44.4	66.3
	MLA	50.8	71.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Shao, C.; Su, Z. Mixed Label Assignment Realizes End-to-End Object Detection. Electronics 2024, 13, 4856. https://doi.org/10.3390/electronics13234856

AMA Style

Chen J, Shao C, Su Z. Mixed Label Assignment Realizes End-to-End Object Detection. Electronics. 2024; 13(23):4856. https://doi.org/10.3390/electronics13234856

Chicago/Turabian Style

Chen, Jiaquan, Changbin Shao, and Zhen Su. 2024. "Mixed Label Assignment Realizes End-to-End Object Detection" Electronics 13, no. 23: 4856. https://doi.org/10.3390/electronics13234856

APA Style

Chen, J., Shao, C., & Su, Z. (2024). Mixed Label Assignment Realizes End-to-End Object Detection. Electronics, 13(23), 4856. https://doi.org/10.3390/electronics13234856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mixed Label Assignment Realizes End-to-End Object Detection

Abstract

1. Introduction

2. Related Work

2.1. End-to-End Object Detection

2.2. Label Assignment

3. Methods

3.1. Mixed Label Assignment

3.1.1. Task Assignment

3.1.2. Matching Cost

3.2. Window Feature Propagation Block

4. Experiments

4.1. Applied Datasets

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Comparisons to State-of-the-Art

4.5. Ablation Study on VOC

4.5.1. Analyses for Mixed Label Assignment

4.5.2. Analyses for Window Feature Propagation Block

4.6. Generalization Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI