Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning

Zhou, Xiao; Wang, Shihong; Hu, Weiguo; Xie, Zhaohao; Pang, Zheng; Jiang, Zhuo; Cheng, Zhen

doi:10.3390/mi16121379

Open AccessArticle

Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning

by

Xiao Zhou

¹,

Shihong Wang

^2,3,

Weiguo Hu

¹,

Zhaohao Xie

^2,4,

Zheng Pang

^2,3,

Zhuo Jiang

⁴

and

Zhen Cheng

^1,3,*

¹

Department of Automation, Tsinghua University, Beijing 100084, China

²

School of Biomedical Engineering, Tsinghua University, Beijing 100084, China

³

National Engineering Research Center for Beijing Biochip Technology, Beijing 102206, China

⁴

College of Food Science, South China Agricultural University, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Micromachines 2025, 16(12), 1379; https://doi.org/10.3390/mi16121379

Submission received: 23 October 2025 / Revised: 27 November 2025 / Accepted: 30 November 2025 / Published: 3 December 2025

(This article belongs to the Special Issue Microfluidics for Single Cell Detection and Cell Sorting)

Download

Browse Figures

Versions Notes

Abstract

Small object localization is one of the most challenging tasks owing to the poor visual appearance and noisy representation caused by the intrinsic structure of small targets. Recent advances in localizing small objects are mainly dependent on regression-based counting approaches, which require considerable annotations for training. As a contrast, human learners can quickly master labeling skills from only a few annotation examples. In this paper, we attempt to simulate this training mechanism and propose a novel positive-unlabeled (PU) learning based approach that can localize small objects by learning from partial point annotations. We evaluate our approach on five typical datasets of small objects involving a single cell, an animal/insect, and human crowds. Quantitative experimental results show that our approach has achieved inspiring localization performance (F1 score > 0.75) even under the supervision of less than 10% of the overall point annotations. This approach paves the way for low-annotation-cost single-cell analysis within microfluidic droplets.

Keywords:

small object; localization; positive-unlabeled (PU) learning; single cell; point annotations

1. Introduction

With the rise of deep convolutional neural networks (CNNs), object detection has shown remarkable progress in the last decade. Localizing small objects, such as cell nuclei in histopathological images [1,2,3,4,5], cells in microfluidic droplets [6,7,8,9], animal/insect populations [10], people in crowds [11,12,13], etc., provides essential clues for the subsequent tasks of segmentation, tracking, density, and morphological analysis. Recent studies of small-object localization mainly focused on predicting a density map of targets [14] and then searching for local density maxima [15,16]. Despite significant advances in the localization of small objects, challenges arise in obtaining adequate center-point annotations for model training in the case of dense-packing small objects, e.g., countless cells in histopathology images or within dense microdroplets [6,7]. Therefore, there is a great need to develop an algorithm that can learn from a spot of point annotation examples and eventually localize and count small objects.

Methodological overview. Consider a scenario where an experienced individual guides a novice in the task of labeling small objects within a visual image. Typically, the educator only needs to identify a select few exemplary instances for the learner, allowing the learner to perform the annotation task skillfully in accordance with their existing domain-specific knowledge and the examples provided, as illustrated in Figure 1. Inspired by this efficient training mechanism, we expect an algorithm to simulate the learning behavior and then complete the labor-intensive tasks [5,17]. Specifically, the algorithm is supposed to master the skill of localizing small objects by learning from a few point annotations in each training image.

Traditionally, if we treat small object localization as a binary pixel-wise classification problem, each point annotation of the target is viewed as a positive sample while the rest of the region is viewed as a negative background [18]. The pixel-wise classification could hardly function in the case of extremely imbalanced samples between sparse positive points and negative backgrounds. Thereafter, most studies had regressed a continuous density map and managed to localize small targets [11,15]. To predict the density map, regression-based approaches need to previously generate a pseudo-density map as the training ground truth according to the locations of all instances. However, for label reduction, only partial small objects in a training image are provided with point annotations, while the rest of the image remains unknown. In other words, it is improbable to yield an intact and reasonable pseudo-density map from annotated examples. Given that only positive point annotations and unlabeled regions are provided, the positive-unlabeled (PU) learning approach [19] can be intuitively employed to address this tricky problem.

Key contributions and applications. We propose a learn-from-educator (LFE) network based on PU learning to tackle the pixel-wise binary classification problem of small object localization. Instead of yielding a continuous density map, we first develop a minimal “redundant count map” that can be generated from incomplete point annotations (see Supplementary Note S1 and Table S1). In addition, the count map greatly alleviates the imbalance between positive targets and a negative background, which allows the PU learning strategy to work effectively in this study. More importantly, the training procedure supervised by the minimal “redundant count map” tends to produce prominent local maxima at the center of cells. Unlike traditional PU learning, which attempts to identify negative samples from the unlabeled data, we introduce an unsupervised label noise modeling approach [20] that can discerningly divide the unlabeled region into positive instances and negative background. Subsequently, supervised by the provided and predicted labels, an ordinary positive–negative learning process is adopted to further improve localization performance. The LFE algorithm is evaluated exhaustively on five different datasets involving cell, animal/insect, and human crowds. We envision that this work also provides an efficient solution for cell-encapsulated microfluidic droplets [6,7], opening a new window for microfluidic applications with minimal annotation cost. The main contributions of this study involves the following:

•: We develop a minimal “redundant count map” that can handle the incomplete annotations and tends to produce prominent local maxima at the center of each target in the corresponding predicted location map.
•: We introduce an unsupervised loss modeling method to the PU learning strategy, which successfully learns to localize small objects from a few point annotation examples.
•: Quantitative experimental results show that our approach can achieve encouraging localization performance even under the supervision of one-tenth of the overall annotations.

2. Related Work

Small object localization. Instead of time-consuming bounding boxes or mask annotations for object detection or segmentation, a simplified strategy with point annotations is adopted by most object localization studies. These point annotations centered on each object are then treated as density maps, with each annotated location represented by a Gaussian kernel. Early studies, e.g., SC-CNN [1] and SR-CNN [21], employ a traditional regression model to predict the density map of animal populations, in which local maxima suggest instance locations [10,22]. Zhang et al. design a multi-column CNN to regress the density map [23]. Later, many studies [24,25,26] attempt to develop more sophisticated architectures to reduce counting errors. Idrees et al. [15] exploit multiple Gaussian kernels to construct a composition loss to sharpen blurred maps and obtain discrete target positions. Liu et al. [11] adopt a normalized variant of cross-entropy loss to improve localization accuracy. Ma et al. [13] propose a Bayesian loss and manage to acquire remarkable results on the localization. Sam et al. [18] introduce a detection framework to simultaneously predict the localization and size of each object. Zhou et al. propose the SFCN-OPI, which employs a dual-branch structure with target prior interaction to enhance both the detection accuracy and fine-grained classification [16]. Additionally, many biomedical studies [27,28,29,30] are devoted to automatically localizing single cells in histopathological images since the manual annotation by seasoned pathologists is both labor-intensive and costly. He et al. [3] employ multiscale density maps as the ground truth to provide reverse guidance to feature extraction. In order to further reduce annotation cost, weakly supervised (e.g., WSL-LIR [17]) and unsupervised cell detection [2] (e.g., SSAE [31]) are also proposed.

Positive-unlabeled (PU) learning. Widely applied to retrieval and outlier detection, PU learning is mainly utilized to train a binary classifier from only positive and unlabeled data [32,33]. The current PU learning approaches can be basically categorized into two branches according to the treatment of unlabeled data. One is to heuristically discriminate negative samples with high confidence from the unlabeled data and then conduct a normal positive–negative learning with the provided positive samples [34]. The other is to directly regard the unlabeled data as negative samples, which suffers from the noise of wrong labels [35]. Aiming to handle a large unlabeled dataset, Sansone et al. [36] propose a scalable PU learning algorithm. Kiryo et al. [19] develop a non-negative risk estimator to avoid the overfitting issue when applying PU learning in sophisticated models, such as neural networks. Instead of recognizing negatives from the unlabeled data, Wang et al. [37] leverage an adaptively augmented algorithm to select positive samples during the training process. Inspired by this approach, we introduce an unsupervised label noise modeling method [20] that characterizes the unlabeled negative loss as a beta mixture model (BMM) with two components to provide pseudo-labels for the unlabeled data. Compared to existing approaches, as shown in Supplementary Table S2, our LFE framework differs fundamentally: it does not rely on full supervision or density map regression, but instead learns effectively from very sparse point annotations via a novel PU-learning mechanism combined with a “redundant count map” representation.

3. Methods and Materials

3.1. Minimal “Redundant Count Map”

Our approach consists of two main stages: PU learning and pseudo-labeling, as illustrated in Figure 2. Before it, in this study, the pseudo-density map utilized in traditional counting-based methods cannot be accurately constructed due to the limited availability of point annotations (only partial targets) within an image. Therefore, we have developed a minimal “redundant count map” that is discretely distributed and, consequently, is able to accommodate incomplete annotations. More importantly, the count map is designed to exhibit a prominent local maximum at each target center.

The method of “redundant count map” was previously proposed in Count-ception [38], wherein a 32 × 32 square kernel was utilized to compute the cell population redundantly. Count-ception is a fully supervised learning approach that regresses a count map whose sum yields the object count. Despite promising counting performance, it fails to localize or predict the x, y coordinates of each object. This is primarily because it directly predicts the count map, which obscures the precise location of each target. In contrast, our approach predicts a location map, as shown in Supplementary Figure S1.

Assuming that the true location map of all small objects is denoted by P_gt, then each element in the redundant count map C_gt can be computed by summing the pixels within a surrounding square kernel centered at the corresponding location:

C_{g t}^{i, j} = \sum_{[m, n] \in S_{r}^{i, j}} P_{g t}^{m, n}

(1)

where

S_{r}^{i, j}

denotes a square kernel of r × r centered at the point [i, j]. Similarly, for the predicted output location map

P_{p r e}

, Equation (1) is applied to derive a predicted count map

C_{p r e}

.

The “redundant count map”, generated by an r × r square kernel, offers a discrete alternative to continuous density maps. This eases segmentation into target and unlabeled regions, and boosts the positive pixel rate by a factor of r × r, effectively alleviating class imbalance for PU learning.

Although a large kernel can significantly increase the positive rate, it may encompass more than one instance in a location map, thereby leading to localization ambiguity, as detailed in Supplementary Note S2. As a compromise, the kernel can be slightly smaller than the minimal size of instances, which not only alleviates the class imbalance but also reduces the localization ambiguity.

It is noteworthy that the redundant count map still preserves the location of each target, which is conducive to generating a prominent local maximum at the target center in the predicted location map, as shown in Supplementary Figure S2. Without loss of generality, assuming that an image with a resolution of 5 × 5 contains a target located at its center and a kernel size of 3 × 3, the location map

P_{g t}

and its corresponding count map

C_{g t}

can be depicted as shown in Figure 1B. The L2 loss between the predicted location map

P_{p r e}

and ground truth

C_{g t}

can be mathematically expressed as follows:

\min_{P_{pre}} \sum_{i, j} {(\sum_{[m, n] \in S_{3}^{i, j}} P_{p r e}^{m, n} - P_{g t}^{i, j})}^{2} s . t . P_{p r e}^{m, n} \geq 0

(2)

Since ReLU [39] is utilized as the activation function for the network employed in this research, it ensures that the elements in the predicted location map remain non-negative. Accordingly, for those zero-value elements

C_{g t}^{i, j} = 0

, the optimal solution for the following equation equals:

{P_{p r e}^{m, n} | [m, n] \in S_{3}^{i, j}} \overset{y i e l d s}{\to} {P_{p r e}^{m, n} = 0 | [m, n] \in S_{3}^{i, j}}

(3)

Subsequently, the sparse and precise solution can be generated by setting

P_{p r e}^{3,3} = 1

, which is equivalent to the location map

P_{g t}

. In other words, the true location map is exactly the optimal solution to Equation (2). Consequently, in the predicted location map, a response distribution with a maximum center will emerge around the location of each true target. Instead of cross-entropy classification loss, L2 loss is utilized to preserve target quantification information, ensuring the predicted location map’s integral reflects the total object count. Since prominent local maxima tend to generate at the instance center in the location map, heuristically identifying local maxima in

P_{p r e}

can provide the precise location of each target.

3.2. Positive-Unlabeled (PU) Learning

PU learning is generally used to address the binary classification tasks in which only positive and unlabeled samples are provided. Let f be an arbitrary classifier and

π_{p}

be the class-prior probability of positive samples. The empirical risk associated with PU learning can be expressed as follows:

{\tilde{R}}_{p u} (f) = π_{p} {\hat{R}}_{p}^{+} (f) - π_{p} {\hat{R}}_{p}^{-} (f) + {\hat{R}}_{u}^{-} (f)

(4)

where

{\hat{R}}_{p}^{+} (f)

denotes the mean empirical risk of positive samples.

{\hat{R}}_{p}^{-} (f)

and

{\hat{R}}_{u}^{-} (f)

represent the mean empirical risk of assigning positive or unlabeled samples to negative classes, respectively. To address the overfitting issue in PU learning, Kiryo et al. [19] proposed a non-negative risk estimator, which resulted in the re-expression of Equation (4):

{\hat{R}}_{p u} (f) = π_{p} {\hat{R}}_{p}^{+} (f) + \max \{0, {\hat{R}}_{u}^{-} (f) - π_{p} {\hat{R}}_{p}^{-} (f)\}

(5)

The loss function in this study is described by this non-negative risk estimator for the PU learning stage, as illustrated in Figure 2. Here,

{\tilde{P}}_{g t}

and

{\tilde{C}}_{g t}

are used to represent the partially annotated location map and its corresponding redundant count map, respectively. To implement the PU learning framework, the elements in

{\tilde{C}}_{g t}

are classified into two categories: non-zero and zero entries. Specifically, the non-zero entries

Φ_{p} = \{[i, j]| {\tilde{C}}_{g t}^{i, j} \neq 0\}

can potentially be considered as positive samples. The zero elements

Φ_{u} = \{[i, j]| {\tilde{C}}_{g t}^{i, j} = 0\}

that cover both positive targets and negative background can be considered as unlabeled data. If we were to directly classify these zero elements

Φ_{u}

in

{\tilde{C}}_{g t}

as unlabeled data, the positive rate in the unlabeled data would be inconsistent with that in the entire training dataset, as it would vary with the number of non-zero entries

Φ_{p}

in

{\tilde{C}}_{g t}

. Fortunately, the empirical risks in both Equations (4) and (5) do not require the independence of positive and unlabeled data. Therefore, a straightforward and reliable approach is to utilize the entire training dataset as the unlabeled data.

As previously discussed, the empirical risk is calculated by utilizing the L2 loss function between the predicted count map

C_{p r e}

and ground truth

{\tilde{C}}_{g t}

. Each item of the empirical risk formula in Equation (5) can be determined as follows:

{\hat{R}}_{p}^{+} (f) = \frac{1}{N_{p}} {‖(C_{p r e} - {\tilde{C}}_{g t}) ⨀ M_{g t}‖}_{F}^{2}

(6)

{\hat{R}}_{u}^{-} (f) - π_{p} {\hat{R}}_{p}^{-} (f) = \frac{1}{N} {‖C_{p r e}‖}_{F}^{2} - \frac{π_{p}}{N_{p}} {‖C_{p r e} ⨀ M_{g t}‖}_{F}^{2}

(7)

where

C_{p r e}^{i, j} = \sum_{[m, n] \in S_{r}^{i, j}} P_{p r e}^{m, n}

.

M_{g t}

is a binary indicator mask that denotes the presence of a non-zero value in

{\tilde{C}}_{g t}

. Therefore, the overall sum of

M_{g t}

, symbolically referred to as

N_{p}

, indicates the number of positive samples. The Hadamard product is denoted by ⊙. The total number of entries in

C_{p r e}

is represented by N. In our experiments, we discovered that the hyperbolic tangent function exhibits a superior capability in suppressing background noise. Therefore, the negative risk in Equation (7) can be replaced by the following:

{\hat{R}}_{u}^{-} (f) - π_{p} {\hat{R}}_{p}^{-} (f) = \frac{1}{N} \sum_{i, j} \tanh (C_{p r e}^{i, j}) - \frac{π_{p}}{N_{p}} \sum_{i, j} \tanh (C_{p r e}^{i, j} M_{g t}^{i, j})

(8)

3.3. Pseudo-Labeling

To further improve localization performance, pseudo-labeling is introduced after the PU learning procedure. This technique involves the utilization of pseudo-labels to identify authentic targets with high confidence within the unlabeled region

Φ_{u}

of the training images. Subsequently, pseudo-labels are assigned to unlabeled elements present in

Φ_{u}

, and an ordinary positive–negative learning procedure can be performed. However, care should be taken to judiciously assign false labels to avoid introducing false positives that could adversely affect the training process.

From another perspective, the unlabeled entries

Φ_{u}

can also be considered negative samples mixed with some noisy (positive) labels. The goal is to identify these noisy labels and correct them with positive annotations. Thereupon, we compute the element-wise distance between the predicted count map

C_{p r e}

and ground truth

{\tilde{C}}_{g t}

as the negative loss of the unlabeled entries:

l = \{{|C_{p r e}^{i, j}|}^{2}| C_{g t}^{i, j} = 0\}

(9)

Intuitively, a larger loss in this case suggests a higher probability of noisy labels and vice versa. To adaptively measure this probability, we exploit an unsupervised approach that models the loss distribution of clean and noisy samples by fitting a beta mixture model (BMM) with two components. The probability density function of a BMM on the loss l can be given by the following:

p (l) = \sum_{k = 1}^{K} λ_{k} p (l| α_{k}, β_{k})

(10)

where

p (l| α_{k}, β_{k})

denotes a beta distribution and

α_{k}, β_{k} > 0

denote its parameters.

λ_{k}

represents the mixing weight of the kth component. Because the clear negative labels (

l_{i, j} < 0.01

) can have a dominant influence on BMM distribution. Therefore, prior to fitting BMM, the clear negative labels are invariably removed. Utilizing an Expectation Maximization (EM) procedure on the loss observations l, we can obtain the parameters, i.e.,

α_{k}, β_{k}

and

λ_{k}

of the BMM. Finally, the probability of an unlabeled entry (corresponding to a local maximum in the predicted location map) being noisy can be computed using the posterior probability:

p (k = 1| l_{i, j}, [i, j] \in ψ_{m a x})

(11)

where

ψ_{m a x}

represents the coordinates set, including all the local maxima found in the predicted location map.

With the quantification of the confidence of each local maximum being a positive target, a simple confidence threshold can be established to classify

Φ_{u}

(unlabeled entries of

{\tilde{C}}_{g t}

) into three categories: clear negatives

Φ_{c l r}^{-}

, predicted positives

Φ_{p r e}^{+}

and predicted negatives

Φ_{p r e}^{-}

. Combining the provided positives

Φ_{p}

, a positive–negative learning process can be implemented to further enhance the localization accuracy of small objects, as shown in Figure 3. Given that the kernel size is smaller than the minimum target size, it is plausible to assume that each predicted positive point in the redundant count map covers a single small object. Like the PU learning stage, a L2-tanh loss is adopted instead of cross entropy to preserve count information, which can be expressed as follows:

L = \frac{1}{N} (\sum_{[m, n] \in Φ^{+}} {(C_{p r e}^{m, n} - 1)}^{2} + \sum_{[i, j] \in Φ^{-}} t a n h (C_{p r e}^{i, j}))

(12)

where

C_{p r e}^{i, j} = \sum_{[m, n] \in S_{r}^{i, j}} P_{p r e}^{m, n} . Φ^{-}

includes the clear and the predicted negatives

Φ_{c l r}^{-}

and

Φ_{p r e}^{-}

.

Φ^{+}

includes the predicted and the provided positives

Φ_{p r e}^{+}

and

Φ_{p}

. More details about the pseudo-labeling procedure are shown in Supplementary Note S3.

3.4. Architectural Design and Training Details

In architectural design, we have elected to utilize our previous model, LIRNet [17], for the task of predicting the two-dimensional location map of small objects. The authors believe that any encoder-decoder network capable of generating a predicted location map with an identical pixel resolution as the input image will be equally effective. As illustrated in Figure 2, the training is composed of two sequential steps, referred to as PU learning and pseudo-labeling. Both stages are optimized by the Adam optimizer [40] for a total of 100 epochs. The learning rates are initialized with 1 × 10⁻⁴ in PU learning and 1 × 10⁻⁶ in pseudo-label learning, respectively, when the batch size is greater than 8. The learning rate is subsequently adjusted based on performance on a separate validation dataset.

During the process of PU learning, it was observed that significant improvements in the localization performance of the network could be achieved by simultaneously incorporating both positive images (few point annotation examples are present in each image) and their enhanced versions (randomly rotated/flipped) into the same training iteration. Consequently, the network is provided with a batch of 4n images (where n = 1, 2, …), which consists of n positive images, n enhanced images (generated by randomly transforming the n positive images), and 2n unlabeled images from a new data loader.

In each training epoch, we randomly select 2/3 of the given point annotations in each training image to yield the ground truth of the location map, which proves to be highly effective in stabilizing the training procedure and preventing overfitting. Considering that searching for local maxima in a predicted location map can be computationally expensive, the early period of PU training may yield unreasonable maps, thus hindering the localization process. To circumvent this issue, a delay of 20 epochs is implemented before initiating the quantification of localization performance. The model that exhibits the best performance on a validation dataset in PU learning is saved and utilized as the initialization for the pseudo-labeling process.

3.5. Parameter Configuration

Accurately adjusting two key parameters, including the kernel size r in Equation (1) and the class-prior probability of positive samples in Equation (8), is critical when processing different datasets. It is important to note that these two parameters are not independent, meaning that if one is fixed, the other can be estimated. As mentioned earlier, all non-zero entries in the ground truth

{\tilde{C}}_{g t}

are considered positive samples. The mathematical computation of each entry in

{\tilde{C}}_{g t}

is calculated by the sum of pixels in a square kernel of r × r from the corresponding location map.

Consequently, the class-prior probability of positive samples

π_{p}

can be estimated by calculating the average proportion of non-zero entries in a complete count map C_gt (not

{\tilde{C}}_{g t}

, which contains only partial positive annotations). It can be expressed as

π_{p} \approx \frac{μ r^{2}}{ω h}

, where

μ

represents the average number of targets in one image,

h

and

ω

represent the height and width of the count map. Assuming that

μ

can be provided or estimated approximately, then

π_{p}

can be calculated by fixing the kernel size r. In order to achieve a compromise between class imbalance and localization ambiguity, r is necessary to be slightly smaller than the minimum size of a target in each dataset. To prevent over-tuning of parameters,

π_{p}

is set to 0.1 for all subsequent experiments.

3.6. Dataset and Evaluation Metric

To evaluate the effectiveness of the proposed approach, we evaluated it on five representative small object datasets, including two hematoxylin and eosin (H&E) stained histopathology image datasets [1,39], two animal population video sequences [5], and one human crowd dataset [13] named ShanghaiTech Part B (abbreviated as ST Part B). For the two animal datasets, which were captured from video sequences, we utilized only the first 32 images, and the subsequent 8 images were utilized for training and validation, consistent with the previous study [10].

As two recently collected human H&E datasets, the Colorectal Adenocarcinoma (CA) dataset and the Modified Bone Marrow (MBM) dataset, were randomly divided into training, validation, and test datasets. Since the ST Part B dataset was provided with fixed training and test samples, we randomly selected a small portion of the training samples as a validation dataset. The detailed split of each dataset is shown in the Supplementary Note S4 and Table S3. We evaluate the performance of the proposed approach with a varying number of point annotation examples, according to the average number of instances in each dataset, as shown in Table 1. For each number of labeled examples, we generated 10 repeated sample selections, each with a different random seed, to train the localization model. Performance results under different numbers of labeled examples were reported by mean and variance across the 10 repeats.

Several metrics, including accuracy, recall, and F1 score, have been employed to evaluate the localization performance of different approaches. For the two histopathology image datasets, if the distance between a target and its nearest location prediction is less than 6 pixels, the location prediction is considered a true positive (TP), consistent with previous studies [1,17]. For the two small animal datasets, the distance threshold for distinguishing a TP is set to the minimal object radius (12 and 5 pixels for Honeybee and Fish, respectively), consistent with the previous study [10]. For the ST Part B dataset, we adopt the literature’s metric Mean Localization Error (MLE) [41] to measure the localization performance between the predicted locations and their matched ground truth.

4. Results and Discussion

4.1. Prominent Local Maxima for Object Localization

To verify the hypothesis that the proposed minimal “redundant count map” tends to generate a prominent local maximum at each target center, we extract the pixel value of each predicted target and its proportion in a square kernel in the predicted location map. Experimental results of prominent local maxima for each dataset are summarized in Table 2. The kernel size r was supposed to cover 99.7% of instances following a normalized Gaussian distribution, i.e.,

\frac{r}{2} = 3 σ

. Therefore, the Gaussian center in Table 2 represents the central response of a normalized Gaussian distribution with a standard deviation of

σ = \frac{r}{6}

. It is found that although the mean local maximum and its mean proportion over the kernel in the predicted location maps decrease as the target/kernel size increases, they are still higher than the Gaussian center by a large margin (greater than three times), especially for Honeybee, Fish, MBM, and ST Part B datasets.

The results of the true and the predicted location map are presented in Figure 4, which illustrates typical examples of location maps in instance-dense regions. It is observed that each target is visually independent even in dense crowds, indicating the effectiveness of the proposed map in handling incomplete annotations. In addition, a key advantage of this method is its tendency to yield prominent local maxima at the centers of targets, serving as reliable markers for localizing small objects (see Supplementary Figure S2). Additionally, the uniqueness of the proposed solution to the minimization problem is separately validated through a theoretical derivation provided in Supplementary Note S5. It ensures that our model converges to a single, plausible estimate of the underlying object locations. This property, enhanced by our well-conditioned transformation, makes the learning process inherently robust to noise and ambiguities in the training annotations. In conclusion, the proposed minimal “redundant count map” is evaluated for localized instances with incomplete annotations and a tendency toward prominent local maxima at the object center.

4.2. Performance Comparison Between Baseline and the Proposed Method

We further quantify the localization performance of the proposed approach and compare it with baseline methods. Experimental results on each test dataset are shown in Table 3, Table 4, Table 5, Table 6 and Table 7 with a varying number of point annotations, in which LFE represents our proposed approach, and the following digits represent the number of point annotations employed in the training process. For example, LFE-3 indicates that only three annotations were provided in each image for training LFE. The percentage of labels in the last column also shows the proportion of partially annotated examples. The bold black font indicates the best performance, and the blue font highlights the performance of the proposed LFE supervised by approximately 10% of all annotation points. The representative visualization of localization results supervised by less than 10% of point annotations is also shown in Figure 5.

CA cells dataset. Table 3 provides a comparison of the proposed approach’s localization performance against other sophisticated methods, including SSAE [31], SR-CNN [21], SC-CNN [1], SFCN-OPI [16], SSL-LIR, and WSL-LIR [17], on the CA cell test dataset. The results indicate that our LFE method utilizing 100 and 50 labeled examples achieves comparable performance to fully supervised SC-CNN and SR-CNN approaches, despite being supervised by only 33.7% and 16.8% of the annotated points, respectively. Furthermore, when compared to our previous weakly supervised WSL-LIR [17], the LFE-50 can reduce annotation costs by more than 50%, while maintaining an almost identical F1 score. It is also concluded that the PU learning strategy with unsupervised loss modeling can successfully learn to localize small instances from a few point annotation examples.

Table 3. The localization performance on the CA cells dataset.

Method	Precision	Recall	F1 Score ↑	Labels
SSAE	0.617	0.644	0.630	0%
SR-CNN	0.783	0.804	0.793	100%
SC-CNN	0.781	0.823	0.802	100%
SFCN-OPI	0.819	0.874	0.834	100%
SSL-LIR	0.854	0.850	0.852	100%
WSL-LIR	0.810	0.777	0.793	43.8%
LFE-100	0.78 ± 0.02	0.83 ± 0.02	0.80 ± 0.01	33.7%
LFE-50	0.76 ± 0.02	0.82 ± 0.01	0.79 ± 0.01	16.8%
LFE-25	0.75 ± 0.02	0.81 ± 0.02	0.78 ± 0.01	8.4%
LFE-10	0.74 ± 0.02	0.75 ± 0.04	0.74 ± 0.02	3.4%

Note: The black bold font indicates the best performance, and the blue with underline highlights the performance of our approach supervised by approximately 10% of all point annotations. “↑” indicates “larger is better.”

Interestingly, when the label reduction is increased by more than 91.6%, the F1 score of the LFE-25 is only 0.023 lower than that of the LFE-100, indicating that our approach is not overly sensitive to label reduction. Even with a further reduction in labeled examples, the LFE-10 (adopting only 3.4% labels) still outperforms the unsupervised SSAE approach by a considerable margin. A distinctive advantage of LFE over unsupervised and fully supervised methodology is its ability to exploit data scarcity. In scenarios where a minute portion (<10%) of the data is labeled, LFE can optimize the existing information by focusing on the most enlightening examples. This strategy can potentially lead to more precise forecasting and enhanced generalization capabilities.

MBM cells dataset. Table 4 exhibits the performance quantification of our approach at different levels of label reduction on histopathological bone marrow images. Exploiting less than 39.7% of annotated points, the LFE-50 network achieves comparable localization performance with the fully supervised SSL-LIR approach [17]. In addition, when less than 8% of labeled examples are utilized for training, the LFE-10 can still precisely localize more than 83% of the targets in the test dataset, suggesting its effectiveness. The LFE framework offers several potential benefits over fully supervised approaches (detailed in Supplementary Note S6), while a key benefit is that it allows for a more efficient use of labeled data.

Table 4. The localization performance on the MBM cells dataset.

Method	Precision	Recall	F1 Score ↑	Labels %
SSL-LIR	0.885	0.873	0.879	100%
LFE-50	0.83 ± 0.02	0.90 ± 0.02	0.87 ± 0.01	39.7%
LFE-25	0.81 ± 0.02	0.88 ± 0.01	0.84 ± 0.01	19.8%
LFE-10	0.76 ± 0.01	0.83 ± 0.04	0.79 ± 0.01	7.9%
LFE-5	0.75 ± 0.03	0.76 ± 0.03	0.75 ± 0.02	4.0%

Note: The black bold font indicates the best performance, and the blue with underline highlights the performance of our approach supervised by approximately 10% of all point annotations. “↑” indicates “larger is better.”

Honeybee and Fish datasets. Compared to the SIDIP method [10], the LFE-15 model achieves the same localization performance on the Honeybee test dataset, as shown in Table 5, while requiring only 53.6% of labeled points. Because the number of honeybees in each image is relatively small, 90% label reduction means fewer than three targets in an image have been annotated. In this situation, our LFE-3 algorithm still successfully detected about 83% of the small honeybees in the test dataset, while 72% of the location predictions suggest true positives, proving the feasibility of small object localization with 90% annotation reduction.

Table 5. The localization performance on the Honeybee dataset.

Method	Precision	Recall	F1 Score ↑	Labels
SIDIP	0.921	0.787	0.849	100%
LFE-15	0.79 ± 0.04	0.91 ± 0.03	0.85 ± 0.03	53.6%
LFE-10	0.78 ± 0.04	0.91 ± 0.03	0.84 ± 0.03	35.7%
LFE-5	0.68 ± 0.06	0.86 ± 0.04	0.76 ± 0.03	17.9%
LFE-3	0.72 ± 0.11	0.83 ± 0.04	0.76 ± 0.05	10.7%

Note: The black bold font indicates the best performance, and the blue underlined text highlights the performance of our approach supervised by approximately 10% of all point annotations. “↑” indicates “larger is better.”

Comparative localization performance on the Fish dataset was also demonstrated in Table 6. It was observed that even though less than 20% of labeled targets were provided for the training stage, our LFE-10 approach still outperforms the fully supervised SIDIP method. In addition, the localization performance of the LFE-5 with 91.1% label reduction is very close to that of SIDIP using 100% annotations. To further demonstrate the effectiveness of the proposed LFE-10 algorithm, we conducted a series of experiments on both datasets. The results are shown in Figure 5A,B, showing that nearly all of the small objects are captured and localized. The LFE-10 algorithm achieves an average localization error of 2 pixels, which is better than the fully supervised SIDIP method. The LFE-10 algorithm also achieved a higher precision and recall rate of 96% and 94% compared to the SIDIP method.

Table 6. The localization performance on the Fish dataset.

Method	Precision	Recall	F1 Score ↑	Labels
SIDIP	0.951	0.921	0.936	100%
LFE-25	0.97 ± 0.01	0.95 ± 0.01	0.96 ± 0.01	44.6%
LFE-15	0.95 ± 0.02	0.95 ± 0.01	0.95 ± 0.01	26.8%
LFE-10	0.96 ± 0.01	0.94 ± 0.02	0.95 ± 0.01	17.9%
LFE-5	0.94 ± 0.03	0.90 ± 0.05	0.92 ± 0.04	8.9%

Note: The black bold font indicates the best performance, and the blue with underline highlights the performance of our approach supervised by approximately 10% of all point annotations. “↑” indicates “larger is better.”

ST Part B. A detailed study of the ST Part B dataset of human crowds shows that the proposed LFE method outperforms other state-of-the-art methods in the MLE metric. The performance comparison on the ST Part B dataset is exhibited in Table 7, where LS-CNN and CSR-A-thr results are directly from the literature [41]. The proposed LFE supervised by 82.3% of point annotations outperforms the fully supervised LSC-CNN method. Encouragingly, even though only 10% of the point annotations (12 examples) are randomly selected to supervise our training procedure, the average localization error is still lower than that of CSR-A-thr by a large margin. A representative visualization of the localization results was shown in Supplementary Figure S3, suggesting that our method would not deteriorate significantly with the decreasing number of annotation examples. The ability of the proposed method to handle complex data structures and large datasets makes it a superior choice for localization tasks of small objects.

Table 7. The localization performance on the ST Part B dataset.

Method	MLE ↓	Labels
CSR-A-thr	12.28	100%
LSC-CNN	9.0	100%
LFE-100	8.70 ± 0.14	81.3%
LFE-50	9.63 ± 0.21	40.6%
LFE-25	10.10 ± 0.24	20.3%
LFE-12	10.79 ± 0.43	9.8%

Note: The black bold font indicates the best performance, and the blue with underline highlights the performance of our approach supervised by approximately 10% of all point annotations. “↓” indicates “smaller is better.”

4.3. Ablation Study on Pseudo-Labeling

To demonstrate the effectiveness of the pseudo-labeling strategy in the LFE method, we recorded and analyzed the localization performance before and after the pseudo-labeling process. The experimental results are depicted in Figure 6. It is notable that the implementation of pseudo-labeling boosted the overall F1 score of localization by 0.5%~5% in the first four datasets. Specifically, in the Honeybee and Fish datasets, the average performance after pseudo-labeling was notably improved by 3.32% and 2.41%, respectively. In the CA and MBM cell datasets, significant improvements in F1 scores were also noted in different numbers of point annotations.

Similarly, the pseudo-labeling process in the ST Part B dataset significantly reduced the mean localization error MLE by 5.9%. Objectively speaking, the efficiency of the LFE method can deteriorate with the reduction of labeled examples due to the decline in supervised information. This is because algorithms rely on the abundance of data to generalize and make accurate predictions. Without sufficient examples, most algorithms may not be able to fully capture the variations and complexity in the data. However, Figure 6 indicates that the LFE method performs remarkably well on the Fish dataset, even with a label reduction of more than 90%. This is attributed to the fact that the small objects in this dataset share similar morphological features, which are less susceptible to the perspective changes. Furthermore, it is evident from Figure 6 that the standard deviation (black error bar) of the F1 score increases with the reduction of labels, because fewer labeled examples are less likely to cover all visual features of targets compared to a majority of annotations. In summary, our method demonstrated consistent performance across all datasets, showing notable improvement in localization performance after pseudo-labeling. In addition, our method is robust for label reduction, as it can effectively handle datasets with more than 90% unlabeled examples.

5. Discussions and Conclusions

Small object localization has found widespread applicability in diverse fields, including single-cell detection, droplet microfluidics, animal population monitoring, and crowd counting. Inspired by the learning behavior exhibited by human annotators, we developed a PU learning algorithm, the learn-from-educator (LFE) model, specifically designed to localize small objects by exploiting a limited number of point-based annotations. Instead of regressing a continuous density map of small objects, LFE employs a discrete minimal “redundant count map”, which tends to produce a prominent local maximum at the spatial center of the target in the predicted location map. To further enhance localization performance, we have incorporated an unsupervised loss modeling approach into the PU learning strategy, which provides pseudo-labels for a subsequent normal positive–negative learning process. If only a small proportion of the data is labeled, LFE may be a more efficient approach (Supplementary Table S4) as it can handle incomplete annotations and make more productive use of available information. Quantitative experimental results suggest that our algorithm significantly reduces annotation costs and achieves encouraging localization performance. The LFE system holds significant potential for application in droplet microfluidics, enabling precise, high-throughput droplet counting and single-cell localization while drastically reducing annotation costs. Looking forward, the straightforward adaptation of our framework to dynamic environments like live-cell microscopy or microfluidics presents a promising direction. By fine-tuning on a minimal annotated subset from a new domain, our data-efficient approach can be readily deployed for temporal analysis.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/mi16121379/s1, Supplementary Note S1. Brief description of localization ambiguity. Note S2. Improve localization performance by pseudo-labelling. Note S3. Detailed illustration for the dataset split. Note S4. Difference between density-map approaches and our method. Note S5. Theoretical derivation for the uniqueness of our method. Note S6. The benefits of the proposed method over fully supervised approaches. Figure S1. Comparison of different annotation and map generation methods. Figure S2. Illustration of the localization ambiguity. Figure S3. Illustration of the optimal solution and the uniqueness of the proposed method. Figure S4. Typical visualization of the localization results on a human crowd using 10% of point annotations. Table S1. The comparison of different losses between density-map- based approaches and our method. Table S2. The summary of different methods that related to this study. Table S3. The structure and split of each dataset used in this paper. Table S4. Benchmark of models: parameters, size, and methodology.

Author Contributions

Conceptualization, X.Z., S.W., and Z.C.; methodology, X.Z., S.W., and Z.C.; software, X.Z., S.W. and W.H.; validation, S.W., W.H., and Z.X.; formal analysis, X.Z. and S.W.; investigation, Z.P. and Z.C.; resources, X.Z. and Z.C.; data curation, X.Z. and W.H.; writing—original draft preparation, X.Z.; writing—review and editing, Z.C.; visualization, X.Z.; supervision, Z.J. and Z.C.; funding acquisition, X.Z. and Z.C.; project administration, Z.J. and Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Natural Science Foundation (Grant No. 3242008), National Natural Science Foundation of China (Grant No. 62571293 and No. 62103228), China Postdoctoral Science Foundation (Grant No. 2023M741850), and Young Elite Scientists Sponsorship Program by CAST (Grant No. 2023QNRC001).

Data Availability Statement

The original contributions presented in this study are included in the article and Supplementary Materials. Further inquiries can be directed to the corresponding author. The code is available on GitHub at the following link https://github.com/wangshihong/LFE (accessed on 20 October 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Sirinukunwattana, K.; Raza, S.E.A.; Tsang, Y.W.; Snead, D.R.J.; Cree, I.A.; Rajpoot, N.M. Locality Sensitive Deep Learning for Detection and Classification of Nuclei in Routine Colon Cancer Histology Images. IEEE Trans. Med. Imaging 2016, 35, 1196–1206. [Google Scholar] [PubMed]
Hou, L.; Nguyen, V.; Kanevsky, A.B.; Samaras, D.; Kurc, T.M.; Zhao, T.H.; Gupta, R.R.; Gao, Y.; Chen, W.J.; Foran, D.; et al. Sparse autoencoder for unsupervised nucleus detection and representation in histopathology images. Pattern Recogn. 2019, 86, 188–200. [Google Scholar] [CrossRef] [PubMed]
He, S.H.; Minn, K.T.; Solnica-Krezel, L.; Anastasio, M.A.; Li, H. Deeply-supervised density regression for automatic cell counting in microscopy images. Med. Image Anal. 2021, 68, 101892. [Google Scholar] [CrossRef] [PubMed]
Hagos, Y.B.; Narayanan, P.L.; Akarca, A.U.; Marafioti, T.; Yuan, Y.Y. ConCORDe-Net: Cell Count Regularized Convolutional Neural Network for Cell Detection in Multiplex Immunohistochemistry Images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2019, Shenzhen, China, 13–17 October 2019; Volume 11764, pp. 667–675. [Google Scholar]
Zhou, X.; Gu, M.; Cheng, Z. Local integral regression network for cell nuclei detection. Entropy 2021, 23, 1336. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Mao, Y.; Gu, M.; Cheng, Z. WSCNet: Biomedical Image Recognition for Cell Encapsulated Microfluidic Droplets. Biosensors 2023, 13, 821. [Google Scholar] [CrossRef] [PubMed]
Mao, Y.; Zhou, X.; Hu, W.; Cheng, Z. Dynamic video recognition for cell encapsulated microfluidic droplets. Analyst 2024, 149, 2147–2160. [Google Scholar] [PubMed]
Mao, Y.; Zhou, X.; Hu, W.; Yang, W.; Cheng, Z. Adaptive sparse pairwise loss for object re-identification. In Proceedings of the 2023 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19691–19701. [Google Scholar]
Xiong, B.; Su, C.; Lin, Z.; Chen, Y.; Zhou, Y.; Cheng, Z.; Yu, Z.; Huang, T. Real-time parameter evaluation of high-speed microfluidic droplets using continuous spike streams. In Proceedings of the 2024 ACM International Conference on Multimedia (MM’ 24), Melbourne, Australia, 15–29 October 2024; pp. 6833–6841. [Google Scholar]
Ma, Z.; Yu, L.; Chan, A.B. Small Instance Detection by Integer Programming on Object Density Maps. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3689–3697. [Google Scholar]
Liu, C.C.; Weng, X.Y.; Mu, Y.D. Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1217–1226. [Google Scholar]
Liu, W.Z.; Salzmann, M.; Fua, P. Context-Aware Crowd Counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5094–5103. [Google Scholar]
Ma, Z.H.; Wei, X.; Hong, X.P.; Gong, Y.H. Bayesian Loss for Crowd Count Estimation with Point Supervision. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6141–6150. [Google Scholar]
Zhou, X.Q.; Zou, Y.X.; Wang, Y. Accurate Small Object Detection Via Density Map Aided Saliency Estimation. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 425–429. [Google Scholar]
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; pp. 544–559. [Google Scholar]
Zhou, Y.N.; Dou, Q.; Chen, H.; Qin, J.; Heng, P.A. SFCN-OPI: Detection and Fine-Grained Classification of Nuclei Using Sibling FCN with Objectness Prior Interaction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2652–2659. [Google Scholar]
Zhou, X.; Cheng, Z.; Gu, M.; Chang, F. LIRNet: Local Integral Regression Network for Both Strongly and Weakly Supervised Nuclei Detection. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; pp. 945–951. [Google Scholar]
Sam, D.B.; Babu, R.V. Top-Down Feedback for Crowd Counting Convolutional Neural Network. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 7323–7330. [Google Scholar]
Kiryo, R.; Niu, G.; Plessis, M.C.d.; Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1674–1684. [Google Scholar]
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.E.; McGuinness, K. Unsupervised Label Noise Modeling and Loss Correction. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97. [Google Scholar]
Xie, Y.P.; Xing, F.Y.; Kong, X.F.; Su, H.; Yang, L. Beyond Classification: Structured Regression for Robust Cell Detection Using Convolutional Neural Network. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 358–365. [Google Scholar]
Xie, Y.P.; Xing, F.Y.; Shi, X.S.; Kong, X.F.; Su, H.; Yang, L. Efficient and robust cell detection: A structured regression approach. Med. Image Anal. 2018, 44, 245–254. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.Y.; Zhou, D.S.; Chen, S.Q.; Gao, S.H.; Ma, Y. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
Li, Y.H.; Zhang, X.F.; Chen, D.M. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
Cao, X.K.; Wang, Z.P.; Zhao, Y.Y.; Su, F. Scale Aggregation Network for Accurate and Efficient Crowd Counting. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; pp. 757–773. [Google Scholar]
Jiang, X.; Xiao, Z.; Zhang, B.; Zhen, X.; Cao, X.; Doermann, D.; Shao, L. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6133–6142. [Google Scholar]
Zhu, R.K.; Sui, D.; Qin, H.; Hao, A.M. An Extended Type Cell Detection and Counting Method based on FCN. In Proceedings of the 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), Washington, DC, USA, 23–25 October 2017; pp. 51–56. [Google Scholar]
Khan, A.; Gould, S.; Salzmann, M. Deep convolutional neural networks for human embryonic cell counting. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 339–348. [Google Scholar]
Raza, S.E.A.; AbdulJabbar, K.; Jamal-Hanjani, M.; Veeriah, S.; Le Quesne, J.; Swanton, C.; Yuan, Y.Y. Deconvolving Convolutional Neural Network for Cell Detection. In Proceedings of the IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 891–894. [Google Scholar]
Xie, W.; Noble, J.A.; Zisserman, A. Microscopy cell counting and detection with fully convolutional regression networks. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2018, 6, 283–292. [Google Scholar] [CrossRef]
Xu, J.; Xiang, L.; Liu, Q.S.; Gilmore, H.; Wu, J.Z.; Tang, J.H.; Madabhushi, A. Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images. IEEE Trans. Med. Imaging 2016, 35, 119–130. [Google Scholar] [PubMed]
Elkan, C.; Noto, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 213–220. [Google Scholar]
Hido, S.; Tsuboi, Y.; Kashima, H.; Sugiyama, M.; Kanamori, T. Inlier-based Outlier Detection via Direct Density Ratio Estimation. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 223–232. [Google Scholar]
Li, X.; Liu, B. Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th international joint conference on Artificial intelligence, Acapulco, Mexico, 9–15 August 2003; pp. 587–592. [Google Scholar]
Bing, L.; Yang, D.; Li, X.L.; Lee, W.S.; Yu, P.S. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, Melbourne, FL, USA, 19–22 November 2003; pp. 179–186. [Google Scholar]
Sansone, E.; De Natale, F.G.B.; Zhou, Z.H. Efficient Training for Positive Unlabeled Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2584–2598. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Long, G. Positive Unlabeled Learning by Sample Selection and Prototype Refinement. In Proceedings of the Advanced Data Mining and Applications, Queensland, Australia, 28–30 November 2022; pp. 304–318. [Google Scholar]
Cohen, J.P.; Boucher, G.; Glastonbury, C.A.; Lo, H.Z.; Bengio, Y. Count-ception: Counting by Fully Convolutional Redundant Counting. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW 2017), Venice, Italy, 22–29 October 2017; pp. 18–26. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; Volume 15, pp. 315–323. [Google Scholar]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Babu, R.V. Locate, Size, and Count: Accurately Resolving People in Dense Crowds via Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2739–2751. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Study motivation and the proposed minimal “redundant count map”. (A) The network was designed to learn and localize small objects with limited annotation examples. Green crosses indicate the center of each target. (B) Illustrations of the localization map and its corresponding minimal “redundant count map”. The different colors correspond to different 3 × 3 square-kernel positions (left) and their “redundant count map” (right).

Figure 2. Network architecture and training procedure of the proposed method for small object localization. In the first stage, both positive examples and unlabeled data are first fed into an encoder-decoder network, which outputs predicted location maps of small objects. Afterward, a predicted redundant count map is generated from the predicted location map and then shapes the loss function for the PU learning process. Green crosses indicate the center of each annotation. In the second stage, pseudo-labeling is initialized by the best model in the PU learning procedure. During the pseudo-labeling stage, the negative loss of the unlabeled region in the training images is fitted as a beta mixture model (BMM), and pseudo-labels are produced to perform ordinary binary classification.

Figure 3. Principle and workflow of the pseudo-labeling procedure. Local maxima in the predicted location map are signified by dots of varying saturation. Squares with different colors were used to represent non-zero count blocks in the corresponding predicted count map.

Figure 4. Representative predicted location maps in instance-dense regions. (A) From top to bottom: original image, ground-truth location map, and predicted location map. (B) Typical results of the LFE framework for nuclei detection on the Colorectal Adenocarcinoma (CA) cell datasets. (C) Typical results of the LFE framework for nuclei detection on the Modified Bone Marrow (MBM) cell datasets. The yellow circles and the red dots correspond to ground-truth and the predicted centers, respectively. For optimal clarity, please refer to the color and zoomed versions of the figures.

Figure 5. Representative localization results on test datasets. Models were trained with fewer than 10% point annotations: (A) honeybee; (B) bish; (C) and (D) human crowd. Yellow circles and red dots denote the ground-truth and predicted object centers, respectively.

Figure 6. Comparison of localization performance with and without pseudo-labeling. Quantitative results across five datasets: (A) CA cells; (B) MBM cells; (C) honeybee; (D) fish; (E) human crowd. Performance is evaluated using the F1 score (“↑”, higher is better) and the Mean Localization Error (MLE; “↓”, lower is better).

Table 1. The number of point annotations used in different datasets.

Dataset	Number of Targets per Image	Number of Point Annotations
CA cells	28 ± 6	15, 10, 5, 3
MBM cells	56 ± 9	25, 15, 10, 5
Fish	126 ± 33	50, 25, 10, 5
Honeybee	298 ± 217	100, 50, 25, 10
ST Part B	123 ± 94	100, 50, 25, 12

Table 2. The quantitative results of local maxima in different datasets.

Dataset	Kernel Size	Local Maximum	Mean Proportion	Gaussian Center
CA	7	0.24 ± 0.06	0.27 ± 0.05	0.12
MBM	11	0.22 ± 0.05	0.23 ± 0.05	0.05
Fish	9	0.20 ± 0.06	0.21 ± 0.06	0.07
Honeybee	15	0.16 ± 0.06	0.19 ± 0.06	0.03
ST Part B	17	0.13 ± 0.02	0.15 ± 0.01	0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Wang, S.; Hu, W.; Xie, Z.; Pang, Z.; Jiang, Z.; Cheng, Z. Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning. Micromachines 2025, 16, 1379. https://doi.org/10.3390/mi16121379

AMA Style

Zhou X, Wang S, Hu W, Xie Z, Pang Z, Jiang Z, Cheng Z. Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning. Micromachines. 2025; 16(12):1379. https://doi.org/10.3390/mi16121379

Chicago/Turabian Style

Zhou, Xiao, Shihong Wang, Weiguo Hu, Zhaohao Xie, Zheng Pang, Zhuo Jiang, and Zhen Cheng. 2025. "Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning" Micromachines 16, no. 12: 1379. https://doi.org/10.3390/mi16121379

APA Style

Zhou, X., Wang, S., Hu, W., Xie, Z., Pang, Z., Jiang, Z., & Cheng, Z. (2025). Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning. Micromachines, 16(12), 1379. https://doi.org/10.3390/mi16121379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Small Object Localization with 90% Annotation Reduction by Positive-Unlabeled Learning

Abstract

1. Introduction

2. Related Work

3. Methods and Materials

3.1. Minimal “Redundant Count Map”

3.2. Positive-Unlabeled (PU) Learning

3.3. Pseudo-Labeling

3.4. Architectural Design and Training Details

3.5. Parameter Configuration

3.6. Dataset and Evaluation Metric

4. Results and Discussion

4.1. Prominent Local Maxima for Object Localization

4.2. Performance Comparison Between Baseline and the Proposed Method

4.3. Ablation Study on Pseudo-Labeling

5. Discussions and Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI