Next Article in Journal
Enhancing Efficiency in Coal-Fired Boilers Using a New Predictive Control Method for Key Parameters
Previous Article in Journal
An Autonomous Land Vehicle Navigation System Based on a Wheel-Mounted IMU
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DP2PNet: Diffusion-Based Point-to-Polygon Conversion for Single-Point Supervised Oriented Object Detection

1
Beijing Research Institute of Telemetry, Beijing 100076, China
2
School of Computer Science, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(1), 329; https://doi.org/10.3390/s26010329
Submission received: 21 October 2025 / Revised: 14 December 2025 / Accepted: 1 January 2026 / Published: 4 January 2026

Abstract

Rotated Bounding Boxes (RBBs) for oriented object detection are labor-intensive and time-consuming to annotate. Single-point supervision offers a cost-effective alternative but suffers from insufficient size and orientation information, leading existing methods to rely heavily on complex priors and fixed refinement stages. In this paper, we propose DP2PNet (Diffusion-Point-to-Polygon Network), the first diffusion model-based framework for single-point supervised oriented object detection. DP2PNet features three key innovations: (1) A multi-scale consistent noise generator that replaces manual or external model priors with Gaussian noise, reducing dependency on domain-specific information; (2) A Noise Cross-Constraint module based on multi-instance learning, which selects optimal noise point bags by fusing receptive field matching and object coverage; (3) A Semantic Key Point Aggregator that aggregates noise points via graph convolution to form semantic key points, from which pseudo-RBBs are generated using convex hulls. DP2PNet supports dynamic adjustment of refinement stages without retraining, enabling flexible accuracy optimization. Extensive experiments on DOTA-v1.0 and DIOR-R datasets demonstrate that DP2PNet achieves 53.82% and 53.61% mAP50, respectively, comparable to methods relying on complex priors. It also exhibits strong noise robustness and cross-dataset generalization.

1. Introduction

Oriented object detection has received widespread attention and research in fields containing complex scenes, such as retail scenes [1,2], scene text [3,4], and aerial images [5,6]. Compared to Horizontal Bounding Boxes (HBBs), Rotated Bounding Boxes (RBBs) can more accurately distinguish objects that are densely arranged or in various orientations [7].
Manual annotation of fine-grained RBBs is both time-consuming and labor-intensive. In contrast, prevalent coarse-grained annotation formats that reduce costs include image-level, HBB-level, and point-level annotations. Among these, image-level annotations are the most cost-effective; however, they offer limited performance in complex aerial images due to the absence of prior location information about objects. The use of HBB-level annotations [8,9] is primarily motivated by the goal of training rotated object detectors on datasets already equipped with such annotations. However, HBB-level annotations are still inefficient and labor-intensive, and they introduce additional noise, compounding the problem. Concluding that neither image-level nor HBB-level annotations are adequate for oriented object detection, we turn to the principles of weakly supervised learning to explore an annotation approach that optimizes performance while minimizing annotation effort.
Point-level annotation is widely utilized across various computer vision tasks [10,11,12,13,14], offering substantial cost reductions of 104.8% and 36.5% compared to RBB and HBB annotations [13], respectively. It is particularly effective for annotating densely packed or small objects in aerial images from a bird’s-eye view. In the realm of oriented object detection, only a limited number of researchers have explored point annotation [12,13,14], proposing a three-step pipeline to utilize it effectively: (1) Train a point label converter (PLC) using images annotated solely with points. (2) Use the trained PLC to convert point annotations into pseudo-RBB annotations. (3) Train any rotated object detector using these pseudo-labeled images in a supervised manner. The essence of this approach centers on the design of an accurate point label converter.
Due to the lack of two key prior information in point annotations: spatial size and orientation, existing PLCs primarily utilize a Multiple Instance Learning (MIL) approach. This strategy involves incorporating additional prior information to create proposal bags from point annotations, which are subsequently optimized based on category labels to select proposals with high classification confidence. Existing Point Label Converters (PLCs) are categorized based on the type of proposal: bounding box or pixel proposals. The former type utilizes manually designed hyperparameters such as size, aspect ratio, and angle to create various bounding box proposals. In contrast, the latter type depends on insights from large-scale models such as SAM [15] to generate pixel mask proposals. However, the two types of PLCs designed for predicting pseudo-rotated bounding boxes are highly dependent on the quality of proposals derived from prior information, which diminishes the algorithm’s generalization capability. Additionally, both PLC types use multiple cascaded stages to progressively refine the size and angle of the pseudo-rotated bounding boxes. Nonetheless, as is shown in Figure 1, the fixed number of refinement stages within the model constrains the algorithm’s performance. Therefore, we have identified the key issues that must be addressed in the design of an efficient and accurate PLC: First, how can we minimize reliance on complex prior information and proposal quality while still accurately generating pseudo-rotated bounding boxes? Second, how can we adjust the number of refinement stages flexibly to produce more precise pseudo-rotated bounding boxes without altering or retraining the model?
In this paper, we introduce the DiffusionPoint2Polygon network, which models Point Label Converters (PLC) as a task that generates inspirational sampling points within the feature map space. These points highlight the semantically significant parts of objects. Inspired by the denoising diffusion model process that gradually removes noise to generate images, we initially designed a multi-scale consistent noise generation module to facilitate the diffusion process. This module introduces multiple Gaussian noises into point annotations and maps them onto feature maps with varying receptive fields. Consequently, this results in the acquisition of several noise sampling point bags with differing receptive fields as proposals. This method of proposal generation, based on noise, reduces dependency on complex prior information. During the denoising process, the objective of this paper is not to entirely remove noise to recover the original point annotations but rather to generate pseudo-RBBs. Consequently, this paper introduces a Noise Cross-Constraint module that processes distinct bags of consistent noise points for each receptive field separately, employing Multiple Instance Learning (MIL) classifiers that match the number of receptive fields. This facilitates the initial identification of objects of various sizes and shapes. After initially identifying the noise sampling points within the object area, this paper designs a Semantic Key Point Aggregator that utilizes the graph convolutional network to consolidate these points into inspiration sampling points, representing semantically significant parts of the object. Subsequently, a pseudo-RBB can be generated for each object using the convex hull formed from these inspiration sampling points. This Point2Polygon method not only avoids direct prediction of the object’s orientation but also eliminates the need for precise segmentation of the object’s contour. Thanks to the progressive refinement characteristic of diffusion models, this paper demonstrates the flexibility to adjust the number of refinement stages by modifying the number of sampling steps during the denoising process, thereby achieving more precise pseudo-RBBs.
To further clarify the technical breakthroughs of this study, we compare this method with existing single-point supervised methods such as PointOBB and P2RBox in three dimensions: “prior type”, “optimization stage flexibility”, and “model framework”, as shown in Table 1. It can be seen that this method realizes the innovation of replacing manual prior with noise prior and fixed-stage optimization with dynamic progressive optimization, which effectively solves the key problems of existing methods.
Our main contributions are as follows:
(1)
This paper models oriented object detection under single-point supervision as the generation process of inspirational sampling points on the feature space. To our best knowledge, this is the first study to apply diffusion models to this field.
(2)
The DiffusionPoint2Polygon Network presented in this paper not only breaks away from the dependency on complex prior information through the multi-scale consistent noise point generation module during the diffusion process, but also incorporates a Noise Cross-Constraint module during the denoising process to perceive noise sampling points within a range of objects of various sizes and shapes. Subsequently, using the Semantic Key Point Aggregator, it generates pseudo-RBBs based on convex hulls formed by inspiration sampling points that indicate semantically important parts of the object.
(3)
Our method exhibits competitive performance on the DOTA and DIOR-R datasets relative to existing approaches that depend on intricate prior information, utilizing solely noise points as the basis for prior knowledge.
The rest of this paper is organized as follows: Section 2 reviews related work, including single-point supervised oriented object detection and the application of diffusion models in perception tasks. Section 3 details the architecture of DP2PNet, the design of core modules, and mathematical principles. Section 4 validates the effectiveness of the proposed method through dataset experiments, performance comparisons, and ablation studies. Section 5 and Section 6 summarizes the research findings and outlines future research directions.

2. Related Work

2.1. Fully Supervised Oriented Object Detection

Oriented object detection has increasingly garnered interest for its application in complex scenes, such as aerial images, where oriented bounding boxes offer enhanced compactness over traditional horizontal ones. Detectors in this domain are principally categorized into anchor-based and anchor-free types. Meanwhile, there are three main supervised training frameworks:
(1)
Image Supervision: Image-level supervision only annotates the categories of objects contained in an image, exemplified by the WSODet method.
(2)
HBB Supervision: As many existing datasets already have HBB annotations, another weakly supervised learning paradigm, represented by H2RBox and H2RBox-v2, utilizes HBBs as weak annotations to obtain RBBs. Their core is to use the self-supervised signal generated by actively rotating images and the inherent symmetry of objects to accomplish the transition from HBBs to RBBs.
(3)
Point Supervision: Point-level annotation provides stronger prior information regarding the object location compared to image-level annotation, while adding relatively minimal annotation costs. Additionally, the cost of point-level annotation is relatively low when compared to other instance-level annotations, such as horizontal or rotated boxes. In the field of oriented object detection, point-level supervision remains a relatively new innovation. Current research primarily focuses on designing a label converter that can transform point annotations into RBBs. PointOBB enhances the MIL network’s ability to perceive the scale and orientation of objects by scaling and rotating the original image, while P2RBOX introduces SAM as a mask generator and designs an inspector module to select high-quality masks for generating RBBs. Label converters can effectively be integrated with the now mature RBox-supervised methods. However, existing label converters heavily rely on the quality of prior information and multi-stage progressive refinement.

2.2. Diffusion Model for Perception Tasks

As a class of deep generative models, diffusion models [16,17], start from the sample in random sample distribution and recover the data sample via a gradual denoising process. Diffusion models have recently achieved significant success in the field of image generation [18] in computer vision, and DiffusionDet [19] has also initially explored its potential in the domain of image discriminative tasks.
(1)
Application of Diffusion Models in General Object Detection: In recent years, diffusion models have been gradually applied to general object detection tasks. DiffusionDet [19] proposes to model the object detection process as a denoising diffusion process of bounding box coordinates, generating candidate boxes from random noise and gradually optimizing them. DiffusionRPN [20] applies the diffusion model to the region proposal network, generating high-quality candidate regions through noise denoising, which improves the recall rate of small targets. Wang [21] designs a dynamic diffusion optimization module to adjust the denoising steps according to the quality of candidate boxes, realizing adaptive optimization of detection results. These methods all follow the “noise generation—step-by-step denoising” framework, which provides a new idea for solving the problem of object detection.
(2)
Research Gaps of Diffusion Models in Oriented Detection: Although diffusion models have shown potential in general object detection, there are still three unsolved problems in oriented detection scenarios: (1) The existing methods do not adapt to the rotation angle prediction requirement of oriented detection. The general object detection only needs to predict the horizontal bounding box, while the oriented detection needs to further predict the rotation angle, and the diffusion model lacks the corresponding angle optimization mechanism. (2) The existing methods are not optimized for the scenario of less annotation information in single-point supervision. Most diffusion-based detection methods rely on full supervision information (such as accurate bounding box annotations) and cannot effectively use the limited information of single-point annotations. (3) The existing methods do not solve the problem of noise adaptation of multi-scale targets. The size difference of targets in oriented detection (especially aerial image detection) is large, and the fixed noise distribution of the existing methods cannot adapt to targets of different scales.

2.3. Positioning of This Study

This study fills the above research gaps from three aspects: (1) By designing a Semantic Key Point Aggregator, the inspiration point set is obtained through graph convolution aggregation of noise points, and the convex hull of the inspiration point set is used to generate pseudo-RBBs, which realizes the angle prediction in oriented detection without directly predicting the angle. (2) Using noise prior instead of manual prior or external model prior, the model can effectively learn from single-point annotation information, reducing the dependence on annotation quantity. (3) By designing a multi-scale consistent noise generator, the noise variance is adaptively adjusted according to the scale of the feature map, realizing the noise adaptation of multi-scale targets.
Table 2 systematically compares DP2PNet with three representative single-point supervised oriented object detection methods across five core dimensions. It highlights DP2PNet’s innovations: replacing manual/external model priors with Gaussian noise, enabling dynamic refinement stages (1–7 stages) instead of fixed ones, and integrating diffusion models with MIL and graph convolution. Notably, DP2PNet avoids direct angle prediction by leveraging convex hulls of semantic key points, achieving competitive mAP50 (53.82%) on DOTA while enhancing generalization and reducing prior dependence.

3. Method

3.1. Preliminaries

(1)
Point label converter:
The fundamental concept of the point label converter is to transform individual point annotations, P = { P u } u = 1 R , within image I into RBB annotations, B = { B u } u = 1 R . Each point annotation P u is represented as ( x , y , c ) , where ( x , y ) specifies the position, and c denotes the category. Similarly, each RBB annotation B u is defined as ( c x , c y , w , h , a ) , where ( c x , c y ) indicates the center position of the RBB, ( w , h ) its width and height, and a the angle. Consequently, the point label converter is designed to address the absence of scale and orientation information in point annotations compared to RBBs.
(2)
Diffusion Model:
Recent diffusion models usually use two Markov chains: a forward chain that perturbs the image to noise and a reverse chain that refines noise back to the image. Formally, given a data distribution x 0 q ( x 0 ) , the forward noise perturbing process at time t is defined as q ( x t | x 0 ) . It gradually adds Gaussian noise to the data according to a variance schedule β 1 , , β T :
q ( x t | x t 1 ) = N ( x t | ( 1 β T ) x t 1 , β t I ) ,
where q ( x t | x t 1 ) is the conditional probability distribution of the noisy point set x t at time step t given x t 1 (following a Gaussian distribution); N ( · | μ , σ 2 I ) denotes a Gaussian distribution with mean μ and covariance matrix σ 2 I ; x t and x t 1 are the point sets at time steps t and t 1 in the diffusion process; β t is the variance schedule parameter controlling noise intensity; I is the identity matrix.
Given x 0 , we can easily obtain a sample of x t by sampling a Gaussian vector ϵ N ( 0 , I ) and applying the transformation as follows:
x t = α t ^ x 0 + ( 1 α t ^ ) ϵ ,
where x 0 is the original single-point annotation (clean data, x 0 = P with position ( x , y ) and category c); α ^ t = s = 0 t ( 1 β s ) simplifies forward noise perturbation; ϵ is Gaussian noise sampled from N ( 0 , I ) ; other symbols are consistent with Equation (1).
In this work, we aim to model the point label converter through diffusion models. In our setup, the data sample is a single-point annotation x 0 = P , and the point label converter f θ ( x t , t , I ) acquires perception of object size and orientation in the process of predicting x 0 from the noise point x t given the corresponding image I . The point label converter can then construct pseudo-RBBs for each single-point annotation based on noise points with the most appropriate sampling steps during the iterative process in the inference stage.

3.2. Architecture and Pipeline

(1)
Architecture & Training:
The overall framework of the point label converter constructed 216 in this paper is shown in Figure 2. During the denoising process, this paper introduces the Noise Cross-Constraint (NCC) module and the Semantic Key Point Aggregator (SKPA)—both designed to reverse the diffusion process while enabling the model to discern object size and identify categorically valuable parts.
The NCC module focuses on perceiving objects of diverse sizes and shapes. It uses M sets of noise point bags with different receptive fields for the u-th object ( { B u ^ } M K ) to match corresponding MIL classifiers, generating classification scores { S u } M K . Each MIL classifier includes two branches: a category-discriminating classification branch and a foreground-background discriminating instance branch. Summing the scores of all points in each noise point bag yields the bag’s total score { S u ^ } M K , which is used to compute the multi-instance loss ( L m i l ). This design allows the module to detect objects across scales via the receptive fields of noise points.
Furthermore, the NCC module integrates a Noise Consistency Loss ( L n c ) based on self-supervised learning. This loss minimizes score disparities among noise point bags of the same object sharing the same receptive field, enhancing the model’s ability to consistently recognize objects of varying shapes.
To align aggregated points with semantically significant parts of the object, the Semantic Key Point Aggregator (SKPA) aggregates noise sampling points from the selected noise point bag via graph convolution, producing semantic key points A = { A u } u = 1 R . A Multilayer Perceptron-based classification head calculates the point loss ( L p ) for these aggregated key points.
Additionally, the Jarvis March algorithm is used to compute the convex hull of the semantic key points, and RROI Align extracts region-specific features to calculate the classification loss ( L r ). This dual-loss design ensures the semantic key points align with the object’s critical classification regions while comprehensively covering the entire object.
(2)
Inference:
Similar to other diffusion-based methods, the inference process of the point label converter involves a denoising procedure—transforming Gaussian-distributed noisy points into annotation-aligned points through multiple sampling steps. However, distinct from standard diffusion models, our weakly supervised point label converter aims to construct RBBs during denoising rather than fully recovering the original point annotations.
In the denoising process, DP2PNet first leverages the Noise Cross-Constraint module to select the optimal feature map F i that matches the object size, based on the classification scores of noise point bags { S u K } F . From the noise point bags corresponding to F i , it further selects the bag { B u j } i with the best object coverage.
Subsequently, DP2PNet evaluates two key criteria for the current-stage inspiration sampling point bag A u c u r r : whether it identifies the most classification-valuable parts of the object, and whether it fully encompasses the object. If denoising is insufficient (i.e., the criteria are not met), the model uses DDIM [17] to estimate the next-stage inspiration sampling point bag A u n e x t from A u c u r r .
To align inference with the training process (which incorporates noise injection), the model supplements A u n e x t with new random Gaussian noise points. This flexible adjustment of denoising steps avoids two critical issues: overly broad noise point coverage due to insufficient denoising and incomplete object coverage due to excessive denoising.

3.3. Multi-Scale Consistent Noise Generator

Due to point annotations lacking information on object scale and orientation compared to RBB annotations, the existing point label converter mainly relies on artificially designed priors composed of a variety of sizes, aspect ratios, and combinations of rotation angles, or introduces prior models such as SAM to provide information on object scale and orientation. To minimize the dependence of the Point Label Converters on prior information, this paper completes the diffusion process in the diffusion model paradigm through the following three steps:
(1)
Sampling Point Padding: Due to the diversity of object scales, to ensure an adequate number of sampling points covers the object, we first add some extra points to the point annotations P u for each object u. Consequently, each object u is associated with a noise point bag P ^ u containing a fixed number of points N. We explored several padding strategies, such as duplicating existing point annotations or connecting random points. Among these, connecting random points yielded the best results.
(2)
Single-scale Consistent Noise Generation: We add Gaussian noise to the set of object-padded points P u ^ at the u-th location, where the noise scale is controlled by α t , and α t employs a monotonically decreasing cosine schedule across different time steps t. Due to the limited receptive field of individual noise sampling points, a point bag containing a fixed number N of points may not effectively cover the object after adding noise generated by a single random seed. Adding more sampling points directly into the point set package P u ^ would be difficult to optimize due to insufficient supervision information. Therefore, at time step t, this paper uses K different random seeds to generate K groups of different noise point bags B = B u K for each point set package P u ^ .
(3)
Multi-scale Noise Sampling Point Mapping: Given the diversity of object scales, the size of the parts of objects most valuable for classification also varies. Thus, the noise sampling point bag B is mapped onto the multi-scale feature map F , enabling each sampling point to obtain deep features with varying receptive fields. The specific process is formalized as follows:
B ^ m = M a p ( B , F e a t u r e m ) , m 1 , ,   M ,
where B ^ m is the noise point bag mapped to the m-th multi-scale feature map; M a p ( · ) maps noise point coordinates to F e a t u r e m for deep feature extraction; B is the original noise point bag set; F e a t u r e m is the m-th FPN output feature map ( M = 3 for P2/P3/P4); M is the number of multi-scale feature layers.

3.4. Noise Cross-Constraint Module

To learn the perception ability of objects at different scales, this module first refers to WSDDN [22] and designs three corresponding dual-stream structures for noise point bags with different receptive fields as the MIL classifier. Specifically, for the K noise point bags { B u ^ } i K R K × N × D on the i-th feature map with the number of channels D corresponding to the u-th object, this paper feeds it into the classification branch f c l s i to obtain { O u c l s } i K R K × N × C , then uses an activation function to obtain classification scores { S u c l s } i K R K × N × C , where N is the number of points in the noise point bag, and C represents the number of categories. Similarly, the instance scores { S u i n s } i K R K × N × C are also obtained through the instance branch f i n s i and an activation function.
{ O u c l s } i K = f c l s i ( { B u ^ } i K ) , { S u c l s } i K = σ 1 ( { O u c l s } i K ) ,
{ O u i n s } i K = f i n s i ( { B u ^ } i K ) , { S u i n s } i K = σ 2 ( { O u i n s } i K ) ,
where { O u c l s } i K is the raw output of the classification branch for the K noise point bags of the u-th object on the i-th feature map; f c l s i ( · ) is the classification branch network; { B u ^ } i K are the mapped noise point bags ( R K × N × D ); { S u c l s } i K are the classification scores ( R K × N × C ); σ 1 is the SoftMax activation function; C is the number of object categories. { O u i n s } i K is the raw output of the instance branch; f i n s i ( · ) is the instance branch network; { S u i n s } i K are the foreground-background discrimination scores ( R K × N × 1 ); σ 2 is the Sigmoid activation function; other symbols are consistent with Equation (4).
By calculating the Hadamard product of { S u c l s } i K and { S u i n s } i K , the noise point score { S u } i K R K × N × C can be obtained. The noise point bag score { S u } ^ i K R K × C can be calculated by summing the scores of N noise points in each noise point bag.
{ S u } i K = { S u c l s } i K { S u i n s } i K R K × N × C ,
{ S u } ^ i K = i = 1 N ( { S u } i K ) R K × C ,
where { S u } i K is the combined score of each noise point; ⊙ denotes the Hadamard product (element-wise multiplication); other symbols are consistent with Equations (4) and (5). { S ^ u } i K is the total score of each noise point bag (sum of N point scores); n = 1 N sums over all points in a bag; other symbols are consistent with Equation (6).
The noise point bag score { S u } ^ i K can be seen as a weighted sum of classification scores and instance scores. The MIL loss for noise point bags with the i-th receptive field uses the form of cross-entropy loss, defined as:
L m i l i = 1 R × K u = 1 R j = 1 K F L ( { S u } ^ i j , c u ) ,
where L m i l i is the MIL loss for the i-th feature map; R is the number of objects in the image; F L ( · , · ) is the focal loss function; { S ^ u } i j is the total score of the j-th bag of the u-th object; c u is the ground-truth category of the u-th object.
The MIL Loss of the Noise Cross-Constraint module is the sum of the MIL loss of all receptive field noise point bags, as shown in the following formula:
L m i l = i = 1 M L m i l i ,
where L m i l is the total MIL loss (sum over M feature maps); L m i l i is the per-feature-map MIL loss; M is the number of multi-scale feature layers.
In an ideal scenario, the classification scores of noise point bags with the same receptive field should remain consistent. Therefore, we use cosine similarity to measure the pairwise similarity between the classification scores of noise point bags packages in the i-th receptive field.
s i m j , t = 1 [ { S u } ^ i j ] · [ { S u } ^ i t ] | | [ { S u } ^ i j | | · | | [ { S u } ^ i t | | ,
where s i m j , t is the cosine similarity between the j-th and t-th bag scores (range [0, 2]); { S ^ u } i j · { S ^ u } i t is their dot product; { S ^ u } i j and { S ^ u } i t are their L2 norms.
After conducting similarity measurements, we can obtain the consistency loss of the i-th receptive field as well as the consistency loss of the Noise Cross-Constraint module.
L n c i = j = 1 K t = 1 K ( l s ( s i m j , t , 0 ) ) , L n c = i = 1 M ( L n c i ) ,
where L n c i is the consistency loss for the i-th feature map; l s ( · , 0 ) is the Smooth L1 loss with target 0; L n c is the total consistency loss; other symbols are consistent with Equation (10).
Finally, the loss of the Noise Cross-Constraint module L C C is the sum of the L m i l and the consistency loss.
L C C = α 1 L m i l + α 2 L n c ,
where L C C is the total loss of the Noise Cross-Constraint module; α 1 = 0.6 and α 2 = 0.4 are loss weights; other symbols are consistent with Equations (9) and (11).
Two core indicators guide the selection of optimal noise point bags (1) Receptive Field Matching (RFM): Measures the adaptability between the receptive field size of the feature map and the actual object size, defined as R F M = 1 log 2 ( R F s i z e / G T s i z e ) , where R F _ s i z e is the receptive field size of the feature map, and G T _ s i z e is the actual area of the ground-truth object; (2) Object Coverage (OC): The proportion of noise points in the bag that fall within the ground-truth object region, defined as O C = N _ i n / N _ t o t a l , where N _ i n is the number of noise points inside the object, and N _ t o t a l is the total number of points in the bag. The optimal noise point bag is selected based on the weighted fusion score S = 0.6 × R F M + 0.4 × O C , with weights determined via cross-validation to balance scale adaptability and region completeness.

3.5. Semantic Key Point Aggregator

As individual noise sampling points in the noise point bags have limited receptive fields and cannot provide overall information about the object, and to avoid direct estimation of the target direction or detailed depiction of the target outline, inspired by RepPoints [23], this paper designs an Semantic Key Point Aggregator (SKPA) to aggregate the sampling points in the noise point bags to obtain a bag of inspiration sampling points that indicate semantically important local areas of the object. However, since point annotations do not provide any size information, we cannot use attention mechanisms to offset the sampling points. Therefore, inspired by fine-grained image classification methods [24,25], we utilize graph convolution [26] to interact the information between each noise sampling point and aggregate them to obtain a bag of inspiration sampling points. Finally, this paper uses the Jarvis March algorithm [27,28] to obtain the convex hull of the bag of inspiration sampling points to obtain the corresponding pseudo-RBB.
The Semantic Key Point Aggregator encapsulates each noise sampling point in the noise point bag { B u ^ } i j R N × D , where i 1 , , M and j 1 , , K , as a graph node. It employs graph convolutional networks to aggregate them based on the feature similarity between the noise sampling points. The pairwise adjacency relationship between different nodes can be defined as:
A m , n = φ ( p m ) · φ ( p n ) | | φ ( p m ) | | · | | φ ( p n ) | | , p { B u ^ } i j ,
where A m , n is the adjacency weight between the m-th and n-th noise points; φ ( · ) is a 1 × 1 convolution layer for feature projection; p m , p n are noise points in { B u ^ } i j ; other terms denote dot product and L2 norm of projected features.
Subsequently, through this densely connected GCN operation, each node can be updated via this similarity-based aggregation:
H 1 = R e L U ( D ˜ 1 2 A ˜ D ˜ 1 2 T ˜ W ˜ g ) ,
where W ˜ g R D × d h is the learnable graph weights with the hidden dimension d h , and D ˜ = n A ˜ m , n is the diagonal matrix for normalization. T ˜ is the matrix form of the noise point bag. Thus, the feature of each node is updated by this message passing operation. Another purpose of the graph embedding operation is to divide the nodes in the graph into multiple groups. This paper learns the grouping rules by introducing a new graph convolution layer H 2 , which aggregates the noisy point bag { B u ^ } i j R N × D { G u } i j R E × D .
H 2 = R e L U ( D ˜ 1 2 A ˜ D ˜ 1 2 H 1 ˜ W ˜ e m b ) ,
where W ˜ e m b R D × E is the graph weights. Thus, H 2 R N × E defines a mapping function to aggregate the original N noise sampling points in the noise point bag into E inspiration sample points. This paper uses H 2 to map the coordinates and features of N noise sampling points to generate the coordinates and features of E inspiration sample points. In this paper, the SoftMax operation is used on the group dimension of H 2 , so the coordinates and features of each inspiration sample point can be viewed as a probabilistic weighted combination of the coordinates and features of the original N noise sample points. After obtaining the { G u } i K , this paper concatenates the features corresponding to the inspiration sample points in the bag { G u } i K , feeds it into the fully connected layer f p , applies the SoftMax activation function to obtain the corresponding classification scores S G u i K , and calculates the loss with the annotated category c u , as shown in the following equation:
{ S G u } i K = f p ( C o n c a t ( { G u } i K ) ) ,
L p = 1 R × K u = 1 R j = 1 K F L ( { S G u } i j , c u ) ,
where { S G u } i K are the classification scores of semantic key point bags ( R K × C ); f p ( · ) is the fully connected layer; C o n c a t ( · ) merges features of E key points; { G u } i K is the set of semantic key point bags. L p is the point loss encouraging key points to align with semantically important parts; other symbols are consistent with Equations (8) and (16).
Subsequently, this paper utilizes RROI Align [27] to extract features of the convex hull regions corresponding to the inspiration sample points bag, and feeds it into the fully connected layer f r . The SoftMax activation function is applied to obtain the corresponding classification scores R G u i K , and the classification is calculated as shown in the following equation:
{ R G u } i K = f r ( C o n v e x H u l l ( { G u } i K ) )
L r = 1 R × K u = 1 R j = 1 K F L ( { R G u } i j , c u )
The loss of the Semantic Key Point Aggregator L S K P A is the sum of L p and L r . The former encourages the module to find inspiration sample points that align with the most class-valuable parts of the object, while the latter motivates the module to ensure that the area surrounded by the insightful sample points found also has good classification ability to more comprehensively cover the object.

3.6. Theoretical Analysis

3.6.1. Mapping Relationship Between Diffusion Process and Pseudo-Box Generation

We define the probability distribution of the noise point set at time step t as P ( x t ) , which follows a Gaussian distribution N ( α ^ t x 0 , ( 1 α ^ t ) I ) . As the number of diffusion steps T increases, when t T , α ^ t 0 , and P ( x t ) N ( 0 , I ) , which is the initial random noise distribution. During the denoising process, the model learns to map the noise point set x t back to the original point annotation x 0 . However, in this paper, we do not need to completely recover x 0 , but to generate a set of inspiration points whose convex hull can approximate the real RBB.
We assume that the real RBB of the object is B = ( c x , c y , w , h , a ) , and the convex hull of the inspiration point set is C H ( A ) . We need to prove that the IoU between C H ( A ) and B has a lower bound. According to the Jarvis March algorithm, the convex hull C H ( A ) is the smallest convex polygon that contains all the inspiration points. For any inspiration point a A , it satisfies a B with high probability (due to the constraint of the Noise Cross-Constraint module and the Semantic Key Point Aggregator). Therefore, C H ( A ) B with high probability, and the area of C H ( A ) is at least the area of the smallest convex polygon containing the semantically important parts of the object. Let the area of the semantically important parts be S i m p , then the area of C H ( A ) is at least S i m p . The area of B is S B = w h . Therefore, the IoU between C H ( A ) and B is at least S i m p S B . In practice, S i m p accounts for a certain proportion of S B (usually more than 50% for most objects), so the IoU has a lower bound of about 0.5.

3.6.2. Convergence Analysis of Noise Cross-Constraint Module

We analyze the convergence of the loss function of the Noise Cross-Constraint module L C C = L m i l + λ n c L n c . We assume that the learning rate is η , and the gradient of the loss function with respect to the model parameters θ is θ L C C . According to the gradient descent method, the update rule of the parameters is θ t + 1 = θ t η θ L C C .
We first analyze the Lipschitz constant of L m i l . The focal loss function F L is a convex function, and its derivative is bounded. The MIL loss L m i l is the average of the focal losses of multiple noise point bags, so it is also a convex function, and its Lipschitz constant L m i l is bounded. Similarly, the consistency loss L n c is a smooth function, and its Lipschitz constant L n c is also bounded. Therefore, the Lipschitz constant of L C C is L = L m i l + λ n c L n c , which is bounded.
According to the convergence theory of gradient descent, when the learning rate η < 2 L , the loss function L C C is monotonically decreasing, and the parameters θ converge to the optimal solution. In this paper, we set the learning rate to 0.001 through cross-validation, which satisfies η < 2 L , ensuring the convergence of the Noise Cross-Constraint module.

4. Experiment

4.1. Datasets and Implementation Details

(1)
Datasets: DOTA-v1.0 [29] is presently among the most widely employed datasets for oriented object detection in aerial images. It comprises 2806 images, 188,282 instances annotated with RBoxes, and is classified into 15 categories. For training and testing, we follow a standard protocol by cropping images into 1024 × 1024 patches with a stride of 824. DIOR-R [30] is an aerial image dataset annotated by RBoxes based on its horizontal annotation version DIOR. The dataset consists of 23,463 images, 190,288 instances, and is classified into 20 categories.
(2)
Single-Point Annotation: In order to accurately simulate manually annotated point annotations, this paper does not directly use the center point of the RBB label as the point annotation. Instead, it selects random points within a range of 10% relative to the width and height of the RBB near the center point as the single-point annotations, thereby reproducing the deviations in manual annotations. The impact of the deviation range will be discussed in Section 5.
(3)
Training Details: The algorithms employed in the experiments of this paper are from the open-source library MMRotate [31] based on Pytorch. This paper follows the default settings in MMRotate. The experiments in this paper were conducted on the NVIDIA 4090 with 24GB of memory. For training the DP2PNet with single-point annotations and the fully supervised rotation box algorithm trained with pseudo-RBBs generated by the DP2PNet in this paper, a “1×” schedule including 12 epochs was used. For the compared algorithms, we followed their base settings.
(4)
Evaluation Metric: Mean Average Precision (mAP) is used as the main metric to compare our method in this paper with existing methods. To evaluate the quality of pseudo-RBBs generated by DP2PNet from point annotations, this paper reports the mean Intersection over Union (mIoU) between manually annotated RBBs (GT) and pseudo-RBBs.
DP2PNet adopts single-point supervision, and its annotation cost is 36.5% lower than HBox supervision and 104.8% lower than RBox supervision. Although its mAP50 is slightly lower than HBox-supervised methods (such as H2RBox-v2), the annotation cost is only 1/10 of the latter, achieving excellent cost-effectiveness.

4.2. Performance Comparisons

4.2.1. Comparisons Results

To verify the performance of DP2PNet in single-point supervised oriented detection, we compared it with existing methods on the DOTA-v1.0 and DIOR-R datasets. Visualization of detection results on the DOTA-v1.0 using DP2PNet (FCOS) is shown in Figure 3 and Figure 4. The visualized detection results demonstrate the effectiveness of our method. In addition, the results on the DOTA-v1.0 dataset and the DIOR-R dataset are shown in Table 3 and Table 4.
(1)
Results on DOTA-v1.0: As shown in Table 3, DP2PNet achieves 53.82% on the mAP 50 by training Rotated FCOS, surpassing point converters such as Point2RBox and PointOBB, which use manually designed hyperparameters. Additionally, our method exhibits competitive performance on the 7- mAP 50 compared to the P2RBox method that uses the SAM method as prior information.
(2)
Results on DIOR-R: As shown in Table 4, DP2PNet achieved 52.14% and 53.61% on mAP50 by training Rotated FCOS and Oriented R-CNN [32], respectively. DP2PNet’s performance surpasses the two-stage alternative, Point-to-HBox-to-RBox (P2BNet + H2RBox-v2), on the mAP 50 metric. Moreover, DP2PNet achieves a 21% improvement in performance on the 8- mAP 50 metric compared to the point converter PointOBB, and it also demonstrates competitive performance with SAM and fully supervised methods.
To verify the generalization ability across different scenarios, we conduct experiments on the ICDAR2015 [33] dataset, a benchmark for scene text detection. It contains 1500 images with over 5000 text instances annotated with rotated bounding boxes. Experimental results show that DP2PNet with FCOS backbone achieves 51.2% mAP50, outperforming PointOBB (47.8%) and P2RBox (49.5%). The strong cross-scenario generalization of DP2PNet stems from two key designs: the multi-scale consistent noise generator adapts to scale differences between aerial objects and scene text, while the Noise Cross-Constraint module enhances the model’s ability to perceive diverse object shapes in non-remote sensing scenarios. Unlike methods relying on scenario-specific priors, our noise-based prior ensures broader applicability.

4.2.2. Analysis of Performance Difference

On the DOTA dataset, the 7 - mAP 50 of our method is slightly lower than that of P2RBox (64.84% vs. 67.25%). The main reason is that P2RBox uses SAM to generate high-precision mask priors, which can more accurately locate the edge of the object, especially for small-scale targets such as a small vehicle. The mIoU of our method for a small vehicle is 58.7%, while that of P2RBox is 62.3%. However, our method does not rely on external models such as SAM and has better generalization and efficiency.
On the DIOR-R dataset, our method achieves competitive performance with SAM (53.61% vs. 53.73% in mAP50). This is because the DIOR-R dataset has a large number of large-scale targets, such as airplanes and ships, and our multi-scale consistent noise generator can better adapt to the scale characteristics of these targets, so as to achieve better performance.

4.2.3. Comparison of Method Complexity

To further evaluate the comprehensive performance of our method, we compare the complexity and efficiency of our method with existing methods, as shown in Table 5.
It can be seen from Table 5 that our method has fewer parameters and faster inference speed than P2RBox and PointOBB. At the same time, our method has lower prior dependence and can maintain higher performance under the condition of less training data, which shows that our method has better efficiency and generalization.

4.3. Limitations and Future Directions

Despite promising results, DP2PNet has limitations: it exhibits relatively lower performance on ultra-small objects (pixel-level) due to challenges in noise point coverage matching. For future work, we plan to: (1) Extend to semi-supervised learning scenarios by combining a small number of fully annotated RBBs to further improve accuracy; (2) Adapt to other weak supervision annotation types, such as line annotations and region annotations, by modifying the noise generation and aggregation strategies; (3) Explore integration with transformer-based feature extractors to enhance semantic feature learning. The noise-based prior design enables DP2PNet to generalize to domains beyond computer vision, such as medical image analysis.

5. Ablation Study

In this section, a series of ablation experiments is designed to validate the key components of DP2PNet.

5.1. The Effect of Multi-Scale Map

Table 6 investigates the effect of the first key design, multi-scale mapping, during the diffusion and denoising process in this paper. As shown in Table 6, mapping noisy points to a single receptive field feature map alone leads to varying degrees of performance degradation. This is because the noisy points in a single receptive field cannot accurately indicate the crucial semantic parts of objects of different sizes. For example, the P2 feature map has a small receptive field, which is suitable for small-scale targets, but the mAP50 for large-scale targets, such as a harbor, is only 42.3%. The P4 feature map has a large receptive field, which is suitable for large-scale targets, but the mAP50 for small-scale targets, such as a small vehicle, is only 40.1%. When using multi-scale feature maps (P2 + P3 + P4), the model can adapt to targets of different scales, and the mAP50 for both small and large-scale targets is improved to more than 50%.

5.2. The Effect of Single-Scale Consistent

Table 7 investigates the effect of the second key design in this paper, single-scale consistency. It can be observed that as the number of bags of noise points with the same receptive field, denoted as K, increases, the model’s performance first rises and then saturates at K = 3 . This is because three bags of noise points with the same receptive field are sufficient to cover objects of different shapes well. When K = 1 , the noise point bag may not cover the object completely, resulting in low mAP50 (48.92%). When K = 2 , the coverage of the object is improved, and the mAP50 increases to 55.67%. When K = 3 , the coverage of the object is further improved, and the mAP50 reaches 58.98%. When K > 3 , increasing K further will only add to the computational burden of the model, but will not significantly improve the performance, and the mAP50 even decreases slightly (58.24% when K = 4 ).

5.3. The Effect of Progressive Refinement

Table 8 examines the effect of the third key design in this paper, progressive refinement. As shown in Table 8, the model in this paper can conveniently adjust the refinement stage by modifying the sampling steps during the denoising process. It is observed that the effectiveness of the model gradually improves as the refinement stages increase, reaching a saturation point when the refinement stage number is 5. When the number of refinement stages is 1, the model has not yet fully denoised the noise points, and the pseudo-RBBs have large deviations, resulting in low mAP50 (37.56%). As the number of refinement stages increases to 3, the model further denoises the noise points, and the mAP50 increases to 40.65%. When the number of refinement stages reaches 5, the model has basically completed the denoising process, and the mAP50 reaches 43.15%. When the number of refinement stages exceeds 5, excessive denoising will lead to the loss of useful information, and the mAP50 decreases slightly (42.87% when S = 7 ).

5.4. The Effect of Point Padding Strategy

As mentioned in Section 3.3, we need to fill additional points for the original single-point annotation of each object so that each point’s point bag contains the same number of points N. We have studied different filling strategies in Table 9. It can be seen that the “Uniform” filling strategy (connecting random points uniformly) achieves the best performance, with mAP50 of 43.12%. The “Repeat” filling strategy (duplicating existing points) has poor performance because it cannot expand the coverage of the point bag, resulting in incomplete object coverage. The “Gaussian” filling strategy (filling points according to a Gaussian distribution) has slightly lower performance than the “Uniform” strategy because the Gaussian distribution is concentrated near the original point, and the coverage of the point bag is limited.

5.5. The Effect of Point Annotation Deviation Range

Table 10 shows the impact of different point annotation deviation ranges on generating point labels. It can be seen that because DP2PNet uses a diffusion model for modeling, it is not sensitive to the noise introduced during labeling. When the deviation range increases from 10% to 50%, the mAP50 only decreases by 0.04% (from 43.15% to 43.11%), and the mIOU only decreases by 0.12% (from 58.98% to 58.86%). This is because the diffusion model has strong noise robustness, and the multi-scale consistent noise generator can adapt to the deviation of point annotations by adjusting the noise distribution.

5.6. The Effect of Noise Level

To verify the noise robustness of DP2PNet, we designed an experiment on the effect of noise level. We set 5 noise levels (noise standard deviation σ = 0.1 , 0.3 , 0.5 , 0.7 , 0.9 ) and compared the performance of our method with PointOBB and P2RBox. The results are shown in Table 11.
It can be seen from Table 11 that with the increase of noise level, the performance of all methods decreases, but the performance attenuation rate of our method is significantly lower than that of PointOBB and P2RBox. When the noise standard deviation reaches 0.9, the performance attenuation rate of our method is only 0.90%, while that of PointOBB and P2RBox is 21.83% and 15.37%, respectively. This shows that our method has strong noise robustness, which is due to the fact that the diffusion model itself is a noise modeling process, and the multi-scale consistent noise generator can adapt to different noise levels by adjusting the noise distribution.

5.7. The Effect of Target Type

To verify the adaptability of DP2PNet to different types of targets, we selected three types of typical targets in the DOTA dataset (small-scale: Small Vehicle, large-scale: Harbor, irregular shape: Baseball Diamond) and compared the performance of our method with existing methods. The results are shown in Table 12.
It can be seen from Table 12 that our method has good adaptability to targets of different types. For small-scale targets (Small Vehicle), although the mAP50 of our method is lower than that of P2RBox, it is higher than that of PointOBB. For large-scale targets (harbor), the mAP50 of our method is higher than that of PointOBB and close to that of P2RBox. For irregular shape targets (Baseball Diamond), the mAP50 of our method is higher than that of PointOBB and close to that of P2RBox. This shows that our method can adapt to the characteristics of different types of targets through the multi-scale consistent noise generator and the Semantic Key Point Aggregator.

5.8. Cross-Dataset Generalization Experiment

To verify the cross-dataset generalization ability of DP2PNet, we trained the model on the DOTA dataset and tested it directly on the DIOR-R dataset without fine-tuning. We compared the performance of our method with PointOBB and P2RBox. The results are shown in Table 13.
It can be seen from Table 13 that our method has better cross-dataset generalization ability. The performance retention rate of our method is 90.0%, which is higher than that of PointOBB (80.3%) and P2RBox (78.5%). This is because our method uses noise as the only prior, which avoids the over-reliance on dataset-specific prior information (such as manually designed scales or external model masks) in existing methods, thus enhancing the generalization across different datasets.

6. Conclusions

In this paper, we propose DP2PNet, which is the first diffusion model-based framework for single-point supervised oriented object detection, addressing the key limitations of existing methods. Three core innovations drive its performance: The multi-scale consistent noise generator replaces manual or external model priors with Gaussian noise, eliminating domain-specific constraints; The Noise Cross-Constraint module enables multi-scale object perception through receptive field matching and object coverage fusion; The Semantic Key Point Aggregator generates pseudo-RBBs via convex hulls of graph-convolution-aggregated semantic points, avoiding direct angle prediction. Experimental results show that DP2PNet achieves 53.82% mAP50 on DOTA-v1.0 and 53.61% on DIOR-R, outperforming PointOBB by over 13% and demonstrating competitive performance with P2RBox and SAM-based methods. Compared to state-of-the-art methods, DP2PNet has fewer parameters and faster inference speed (22 FPS vs. 18 FPS for PointOBB). Future work will focus on three directions: integrating transformer architectures to enhance feature extraction, extending to semi-supervised and few-shot learning scenarios, and adapting to diverse weak supervision annotation types.

Author Contributions

Conceptualization, P.L.; Software, P.L.; Validation, P.L.; Formal Analysis, L.Z.; Investigation, T.Q.; Writing—Original Draft, P.L.; Writing—Review and Editing, P.L. and L.Z.; Visualization, L.Z.; Supervision, T.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sikic, F.; Kalafatic, Z.; Subasic, M.; Loncaric, S. Enhanced Out-of-Stock Detection in Retail Shelf Images Based on Deep Learning. Sensors 2024, 24, 693. [Google Scholar] [CrossRef]
  2. Hu, Z.; Zeng, X.; Xie, K.; Wen, C.; He, J.; Zhang, W. Efficient Defect Detection of Rotating Goods under the Background of Intelligent Retail. Sensors 2024, 24, 467. [Google Scholar] [CrossRef]
  3. Liao, M.; Shi, B.; Bai, X. Textboxes++: A single-shot oriented scene text detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef] [PubMed]
  4. Zhukov, A.; Rivero, A.; Benois-Pineau, J.; Zemmari, A.; Mosbah, M. A Hybrid System for Defect Detection on Rail Lines through the Fusion of Object and Context Information. Sensors 2024, 24, 1171. [Google Scholar] [CrossRef]
  5. Zhang, T.; Zhang, X.; Liu, C.; Shi, J.; Wei, S.; Ahmad, I.; Zhan, X.; Zhou, Y.; Pan, D.; Li, J.; et al. Balance learning for ship detection from synthetic aperture radar remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 182, 190–207. [Google Scholar] [CrossRef]
  6. Shan, D.; Yang, Z.; Wang, X.; Meng, X.; Zhang, G. An Aerial Image Detection Algorithm Based on Improved YOLOv5. Sensors 2024, 24, 2619. [Google Scholar] [CrossRef] [PubMed]
  7. Liu, W.; Zhang, T.; Huang, S.; Li, K. A hybrid optimization framework for UAV reconnaissance mission planning. Comput. Ind. Eng. 2022, 173, 108653. [Google Scholar] [CrossRef]
  8. Yang, X.; Zhang, G.; Li, W.; Wang, X.; Zhou, Y.; Yan, J. H2rbox: Horizontal box annotation is all you need for oriented object detection. arXiv 2022, arXiv:2210.06742. [Google Scholar]
  9. Yu, Y.; Yang, X.; Li, Q.; Zhou, Y.; Zhang, G.; Yan, J.; Da, F. H2rbox-v2: Boosting hbox-supervised oriented object detection via symmetric learning. arXiv 2023, arXiv:2304.04403. [Google Scholar] [CrossRef]
  10. Chen, L.; Yang, T.; Zhang, X.; Zhang, W.; Sun, J. Points as queries: Weakly semi-supervised object detection by points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8823–8832. [Google Scholar] [CrossRef]
  11. Chen, P.; Yu, X.; Han, X.; Hassan, N.; Wang, K.; Li, J.; Zhao, J.; Shi, H.; Han, Z.; Ye, Q. Point-to-box network for accurate object detection via single point supervision. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 51–67. [Google Scholar]
  12. Cao, G.; Yu, X.; Yu, W.; Han, X.; Yang, X.; Li, G.; Jiao, J.; Han, Z. P2RBox: A Single Point is All You Need for Oriented Object Detection. arXiv 2023, arXiv:2311.13128. [Google Scholar] [CrossRef]
  13. Luo, J.; Yang, X.; Yu, Y.; Li, Q.; Yan, J.; Li, Y. PointOBB: Learning Oriented Object Detection via Single Point Supervision. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16730–16740. [Google Scholar] [CrossRef]
  14. Yi, Y.; Yang, X.; Li, Q.; Da, F.; Yan, J.; Dai, J.; Qiao, Y. Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision. arXiv 2023, arXiv:2311.14758. [Google Scholar]
  15. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
  16. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  17. Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
  18. Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar] [CrossRef]
  19. Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 19830–19843. [Google Scholar]
  20. Xue, H.; Ren, J.; Chen, W.; Zhang, G.; Fang, Y.; Gu, G.; Xu, H.; Lu, C. Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation. arXiv 2025, arXiv:2503.02881. [Google Scholar]
  21. Wang, T.; Dou, Z.; Bao, C.; Shi, Z. Diffusion Mechanism in Residual Neural Network: Theory and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 667–680. [Google Scholar] [CrossRef]
  22. Bilen, H.; Vedaldi, A. Weakly Supervised Deep Detection Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 2846–2854. [Google Scholar] [CrossRef]
  23. Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar] [CrossRef]
  24. Chou, P.Y.; Lin, C.H.; Kao, W.C. A novel plug-in module for fine-grained visual classification. arXiv 2022, arXiv:2202.03822. [Google Scholar]
  25. Yan, B.; Wang, S.; Zhu, E.; Liu, X.; Chen, W. Group-Attention Transformer for Fine-Grained Image Recognition. In Proceedings of the International Conference on Artificial Intelligence and Security; Springer: Cham, Switzerland, 2022; pp. 40–54. [Google Scholar] [CrossRef]
  26. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
  27. Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [Google Scholar] [CrossRef]
  28. Jarvis, R.A. On the identification of the convex hull of a finite set of points in the plane. Inf. Process. Lett. 1973, 2, 18–21. [Google Scholar] [CrossRef]
  29. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
  30. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  31. Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar] [CrossRef]
  32. Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
  33. Peyrard, C.; Baccouche, M.; Mamalet, F.; Garcia, C. ICDAR2015 competition on Text Image Super-Resolution. In Proceedings of the 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, 23–26 August 2015; pp. 1201–1205. [Google Scholar] [CrossRef]
Figure 1. The Figure sequentially illustrates the oriented object detection task based on point supervision, the primary structure of the existing point converter with its drawbacks (highlighted in red font), and the newly designed point converter in this paper featuring targeted enhancements (highlighted in green font). The critical element of the task pipeline is the design of the point converter that transforms point annotations into pseudo-Rotated Bounding Boxes.
Figure 1. The Figure sequentially illustrates the oriented object detection task based on point supervision, the primary structure of the existing point converter with its drawbacks (highlighted in red font), and the newly designed point converter in this paper featuring targeted enhancements (highlighted in green font). The critical element of the task pipeline is the design of the point converter that transforms point annotations into pseudo-Rotated Bounding Boxes.
Sensors 26 00329 g001
Figure 2. The Figure shows the overall structure of the DP2PNet described in this paper. We use the P2, P3, and P4 feature maps for the subsequent processes. The multi-scale consistent noise generator distributes K groups of noise point bags across three feature maps to complete the diffusion process.
Figure 2. The Figure shows the overall structure of the DP2PNet described in this paper. We use the P2, P3, and P4 feature maps for the subsequent processes. The multi-scale consistent noise generator distributes K groups of noise point bags across three feature maps to complete the diffusion process.
Sensors 26 00329 g002
Figure 3. Examples of detection results on the DOTA-v1.0 using DP2PNet (FCOS).
Figure 3. Examples of detection results on the DOTA-v1.0 using DP2PNet (FCOS).
Sensors 26 00329 g003
Figure 4. Visual comparison with Ours (DP2PNet), P2BNet, and PointOBB.
Figure 4. Visual comparison with Ours (DP2PNet), P2BNet, and PointOBB.
Sensors 26 00329 g004
Table 1. Comparison of innovative points between this method and existing single-point supervised methods.
Table 1. Comparison of innovative points between this method and existing single-point supervised methods.
MethodPrior TypeOptimization Stage FlexibilityModel Framework
PointOBBManual designFixed (5 stages)MIL + fixed-stage refinement
P2RBoxExternal model (SAM mask)Fixed (3 stages)SAM + inspector module
Ours (DP2PNet)Noise prior (Gaussian noise)Dynamic (1–7 stages)Diffusion model + MIL + graph convolution
Table 2. Comparison with prior work in single-point supervised sriented object detection.
Table 2. Comparison with prior work in single-point supervised sriented object detection.
MethodPrior
Type
Optimization
Flexibility
Core
Module
Angle
Prediction Method
mAP50
(DOTA)
PointOBBmanual design
(Sc & Ra & An)
fixed
(5 stages)
MIL + fixed-stage
refinement
direct prediction30.08%
P2RBoxexternal model
(SAM mask)
fixed
(3 stages)
SAM + inspector
module
mask-based fitting58.40%
DiffusionDetgaussian noisedynamicdenoising diffusion for
bounding boxes
coordinate optimization-
Ours (DP2PNet)gaussian noisedynamic
(1–7 stages)
Diffusion + MIL +
graph convolution
convex hull of
semantic points
53.82%
Note: Sc = scale hyperparameter, Ra = aspect ratio hyperparameter, An = angle hyperparameter; “-” indicates the method is not designed for single-point supervised oriented object detection.
Table 3. Accuracy on the DOTA-v1.0 testing set. Plane (PL), Baseball Diamond (BD), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), and Harbor (HA). R-Form denotes the resultformat without any post-processing operations, such as the minimum rectangle operation, while P-Form represents the prior information necessary for the method. Sc signifies the scale hyperparameter, Ra refers to the aspect ratio hyperparameter, An indicates the angle hyperparameter, and the symbol & represents the combination of these hyperparameters.
Table 3. Accuracy on the DOTA-v1.0 testing set. Plane (PL), Baseball Diamond (BD), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), and Harbor (HA). R-Form denotes the resultformat without any post-processing operations, such as the minimum rectangle operation, while P-Form represents the prior information necessary for the method. Sc signifies the scale hyperparameter, Ra refers to the aspect ratio hyperparameter, An indicates the angle hyperparameter, and the symbol & represents the combination of these hyperparameters.
MethodR-FormP-FormPLBDGTFSVLVSHHA7-mAP50mAP50Annotation
Cost Ratio
RBox-supervised:
Rotated RetinaNet:RBox-88.777.658.274.671.679.162.673.1968.69100%
Rotated FCOS:RBox-88.476.859.279.279.086.969.376.9671.28100%
HBox-supervised:
H2RBox:RBoxAn88.573.556.977.565.477.952.470.2967.2163.5%
H2RBox-v2:RBoxAn89.074.460.579.875.386.965.275.8872.5263.5%
Point-supervised:
P2BNet + H2RBox-v2HBoxSc & Ra11.044.815.436.816.727.812.623.5821.8736.5%
SAM (FCOS):Mask-78.261.745.168.764.878.645.763.2650.8485%
P2RBOX (FCOS):MaskSAM86.766.047.472.471.378.648.467.2558.40-
Point2RBox:RBoxSc & Ra & An66.459.552.654.153.957.322.952.3844.9036.5%
PointOBB (FCOS):RBoxSc & Ra & An26.165.759.465.834.929.821.843.3530.0836.5%
Ours (FCOS):PolygonNoise81.363.248.470.667.177.246.164.8452.3736.5%
Ours (Oriented R-CNN):PolygonNoise82.664.149.871.268.978.148.466.1653.8236.5%
Note: Annotation Cost Ratio is cited from PointOBB: Point annotation cost is 36.5% of RBox annotation, HBox annotation is 63.5% of RBox annotation.
Table 4. Accuracy on the DIOR-R testing set. Airplane (APL), Baseball Field (BF), Basketball Court (BC), Ground Track Field (GTF), Harbor (HA), Ship (SH), Tennis Court (TC), and Vehicle (VE), which are representative categories in remote sensing. Annotation Cost Ratio is relative to RBB full annotation (set to 100%); lower values indicate lower annotation cost.
Table 4. Accuracy on the DIOR-R testing set. Airplane (APL), Baseball Field (BF), Basketball Court (BC), Ground Track Field (GTF), Harbor (HA), Ship (SH), Tennis Court (TC), and Vehicle (VE), which are representative categories in remote sensing. Annotation Cost Ratio is relative to RBB full annotation (set to 100%); lower values indicate lower annotation cost.
MethodR-FormP-FormAPLBFBCGTFHASHTCVE8-mAP50mAP50Annotation
Cost Ratio
RBox-supervised:
Rotated RetinaNet:RBox-58.973.181.332.532.475.181.044.564.2654.96100%
Rotated Faster-R-CNN:RBox-63.179.182.840.755.981.181.465.669.8862.80100%
Rotated FCOS:RBox-61.474.381.132.848.580.063.942.766.5959.83100%
Image-supervised:
WSODet:HBoxSc & Ra20.763.267.30.31.51.240.06.128.4622.205%
HBox-supervised:
H2RBox:RBoxAn68.175.085.434.744.279.381.540.067.8057.8063.5%
H2RBox-v2:RBoxAn67.255.680.880.325.378.882.542.064.0657.6463.5%
Point-supervised:
P2BNet+H2RBox-v2HBoxSc & Ra51.665.278.344.92.335.979.010.345.9423.6136.5%
SAM (FCOS):Mask-62.173.280.476.612.273.551.736.566.5853.7385%
PointOBB (FCOS):RBoxSc & Ra & An58.470.777.774.29.969.046.132.454.8037.3136.5%
Ours (Rotated FCOS):PolygonNoise59.872.379.376.914.174.249.334.865.8152.1436.5%
Ours (Oriented R-CNN):PolygonNoise60.173.480.277.913.474.150.335.666.4353.6136.5%
Note: Annotation Cost Ratio is cited from PointOBB: Point annotation cost is 36.5% of RBox annotation, HBox annotation is 63.5% of RBox annotation, and image-level annotation is 5% of RBox. annotation.
Table 5. Comparison of method complexity and performance.
Table 5. Comparison of method complexity and performance.
MethodNumber of
Parameters (M)
Inference
Speed (FPS)
Prior DependenceTraining Data Requirement
(10% Annotation Data
Performance Retention Rate)
P2RBOX [12] (FCOS)45.615High (SAM mask)72%
PointOBB [13] (FCOS)39.818High (Sc & Ra & An)68%
Ours (Oriented R-CNN)38.222Low (Noise prior)85%
Table 6. Ablation studies of the multi-scale map in DOTA-v1.0 by training Rotated FCOS. P2, P3, and P4 represent the feature maps from the Feature Pyramid Network (FPN) output, with increasing receptive fields.
Table 6. Ablation studies of the multi-scale map in DOTA-v1.0 by training Rotated FCOS. P2, P3, and P4 represent the feature maps from the Feature Pyramid Network (FPN) output, with increasing receptive fields.
FPN-P2FPN-P3FPN-P4mIOU7-mAP 50mAP 50mAP50
(Large-Scale Targets)
--51.8759.8144.6742.3
--55.2162.1548.7249.8
--49.1458.7843.1545.9
58.9864.8452.3753.8
Table 7. Ablation studies of the Single-Scale Consistent in DOTA-v1.0 by training FCOS. K represents the number of noise point bags with the same receptive field.
Table 7. Ablation studies of the Single-Scale Consistent in DOTA-v1.0 by training FCOS. K represents the number of noise point bags with the same receptive field.
KmIOU7-mAP 50mAP 50Computational Burden (GFLOPs)
150.3255.8948.9215.6
254.7461.4555.6722.3
358.9864.8458.9828.9
458.2464.1358.2435.6
Table 8. Ablation studies of the Progressive Refinement Strategy in DOTA-v1.0 by training FCOS. S indicates the number of refinement stages in the denoising process.
Table 8. Ablation studies of the Progressive Refinement Strategy in DOTA-v1.0 by training FCOS. S indicates the number of refinement stages in the denoising process.
SmIOUmAP 50Pseudo-RBB Deviation (Pixel)
151.5737.5615.3
354.2140.658.7
558.9843.153.2
757.7642.874.1
Table 9. Ablation studies of the Point Padding Strategy in DOTA-v1.0 by training Rotated FCOS.
Table 9. Ablation studies of the Point Padding Strategy in DOTA-v1.0 by training Rotated FCOS.
PadmIOUmAP 50Point Bag Coverage Rate (%)
Repeat58.2142.7465.3
Gaussian57.7142.4378.6
Uniform58.9843.1292.1
Table 10. Ablation studies of the Point Annotation Deviation Range in DOTA-v1.0 by training Rotated FCOS. PR represents the point annotation deviation range relative to the width and height of the RBB.
Table 10. Ablation studies of the Point Annotation Deviation Range in DOTA-v1.0 by training Rotated FCOS. PR represents the point annotation deviation range relative to the width and height of the RBB.
PRmIOUmAP 50Pseudo-RBB Accuracy Rate (%)
10%58.9843.1589.2
30%58.8443.0988.7
50%58.8643.1188.9
Table 11. Ablation studies of the Noise Level in DOTA-v1.0 by training Rotated FCOS.
Table 11. Ablation studies of the Noise Level in DOTA-v1.0 by training Rotated FCOS.
Noise Std ( σ )MethodmIOUmAP 50Performance Attenuation Rate (%)
0.1Ours59.2343.280
0.1PointOBB52.1538.210
0.1P2RBox62.3158.670
0.3Ours59.0143.210.16
0.3PointOBB50.3236.544.37
0.3P2RBox60.1256.324.01
0.5Ours58.9843.150.30
0.5PointOBB48.2134.878.74
0.5P2RBox58.4554.157.70
0.7Ours58.7643.020.60
0.7PointOBB45.1232.1515.86
0.7P2RBox56.2351.8711.59
0.9Ours58.5442.890.90
0.9PointOBB42.3129.8721.83
0.9P2RBox54.1249.6515.37
Table 12. Ablation studies of the Target Type in DOTA-v1.0 by training Rotated FCOS.
Table 12. Ablation studies of the Target Type in DOTA-v1.0 by training Rotated FCOS.
Target TypeMethodmIOUmAP 50Pseudo-RBB
Edge Accuracy
(Pixel)
Small Vehicle (Small-scale)Ours58.750.33.5
Small Vehicle (Small-scale)PointOBB52.345.65.2
Small Vehicle (Small-scale)P2RBox62.358.62.1
Harbor (Large-scale)Ours64.153.84.2
Harbor (Large-scale)PointOBB52.345.96.8
Harbor (Large-scale)P2RBox60.256.33.8
Baseball Diamond (Irregular shape)Ours61.552.73.9
Baseball Diamond (Irregular shape)PointOBB48.742.37.1
Baseball Diamond (Irregular shape)P2RBox57.854.13.5
Table 13. Cross-dataset generalization experiment results (trained on DOTA, tested on DIOR-R).
Table 13. Cross-dataset generalization experiment results (trained on DOTA, tested on DIOR-R).
MethodmAP 508-mAP 50Performance Retention Rate (%)
Ours48.260.190.0
PointOBB39.548.780.3
P2RBox42.152.378.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, P.; Zhang, L.; Qu, T. DP2PNet: Diffusion-Based Point-to-Polygon Conversion for Single-Point Supervised Oriented Object Detection. Sensors 2026, 26, 329. https://doi.org/10.3390/s26010329

AMA Style

Li P, Zhang L, Qu T. DP2PNet: Diffusion-Based Point-to-Polygon Conversion for Single-Point Supervised Oriented Object Detection. Sensors. 2026; 26(1):329. https://doi.org/10.3390/s26010329

Chicago/Turabian Style

Li, Peng, Limin Zhang, and Tao Qu. 2026. "DP2PNet: Diffusion-Based Point-to-Polygon Conversion for Single-Point Supervised Oriented Object Detection" Sensors 26, no. 1: 329. https://doi.org/10.3390/s26010329

APA Style

Li, P., Zhang, L., & Qu, T. (2026). DP2PNet: Diffusion-Based Point-to-Polygon Conversion for Single-Point Supervised Oriented Object Detection. Sensors, 26(1), 329. https://doi.org/10.3390/s26010329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop