Next Article in Journal
A Dual-Game-Based Physical Layer Security Framework for UAV Cooperative Communication
Next Article in Special Issue
Joint Denoising and Motion-Correction for Low-Dose CT Myocardial Perfusion Imaging Using Deep Learning
Previous Article in Journal
CR-MAT: Causal Representation Learning for Few-Shot Non-Intrusive Load Monitoring
Previous Article in Special Issue
A Deep Learning Model for Wave V Peak Detection in Auditory Brainstem Response Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cost-Aware Active Learning Framework for Efficient Small-Object Detection in Agricultural Images

1
Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia
2
University Department of Professional Studies, University of Split, 21000 Split, Croatia
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(6), 1196; https://doi.org/10.3390/electronics15061196
Submission received: 15 February 2026 / Revised: 9 March 2026 / Accepted: 10 March 2026 / Published: 13 March 2026

Abstract

Although active learning can reduce the effort required to annotate object detection data, many current methods rely on a single selection criterion or combine criteria without accounting for annotation costs or their interactions. This paper presents a multi-criterion, cost-aware active learning framework for detecting small objects in agricultural images. The framework jointly considers prediction uncertainty, object size, scene density, and annotation cost. We evaluate both scalarized and Pareto-based selection strategies across five cost models and conduct an ablation study to examine the role and interactions of each criterion. Experimental results demonstrate that explicit annotation cost modeling improves active learning efficiency by reducing the amount of annotation required to achieve a given level of detection performance. Across multiple cost formulations and selection strategies, cost-aware acquisition reaches comparable accuracy and reduces the estimated annotation effort required to reach comparable detection performance by up to 50% compared to random sampling, where annotation effort is approximated using prediction-derived cost proxies.

1. Introduction

Active learning (AL) has been widely studied as a means of reducing annotation effort by iteratively selecting informative samples for labelling rather than annotating large datasets exhaustively. Early AL methods were developed for classification and relied primarily on uncertainty-based acquisition criteria such as entropy or margin sampling [1]. With the rise of deep learning, active learning has been extended to large-scale visual recognition tasks, where uncertainty-based, diversity-based, and hybrid strategies have been proposed [2,3,4,5]. While uncertainty-based methods prioritise ambiguous samples, diversity-based approaches aim to reduce redundancy by selecting representative subsets of the data.
Applying active learning to object detection is more challenging than to classification because detection outputs are structured and combine localisation and classification. Existing approaches typically adapt uncertainty measures by aggregating confidence scores across predicted bounding boxes or by modelling localisation uncertainty [6,7,8]. Hybrid methods that combine uncertainty with diversity or representativeness have also been explored to mitigate redundancy [9,10,11]. Despite promising results, most active learning methods for object detection rely on a limited set of acquisition signals and implicitly assume uniform annotation cost across images.
In agricultural vision, active learning has been applied to tasks such as plant phenotyping, disease recognition, and fruit detection, demonstrating reductions in labelling effort compared to random sampling [12,13,14]. However, applications to object detection remain comparatively limited and often rely on single-criterion acquisition functions. Moreover, domain-specific challenges such as small object size, high object density, and non-uniform annotation cost are rarely addressed explicitly, despite their strong influence on both detection performance and labelling effort in orchard imagery.
Recent work has highlighted the importance of cost-aware active learning, recognising that annotation effort varies substantially across samples and should be explicitly modelled [15,16]. In object detection, early cost-aware approaches measured effort by the number of bounding boxes or regions requiring annotation. More recent studies have shown that cost-aware selection can improve annotation efficiency by balancing information gain against labelling burden. Nevertheless, most cost-aware approaches focus on a single notion of informativeness and do not jointly model multiple sources of detection difficulty, such as uncertainty, object scale, and scene density.
In our previous work [17], we presented an olive detection pipeline that integrates a YOLOv8-based active learning approach to evaluate acquisition strategies based on missed detections, prediction confidence, and object size. Although the experiments show that the implemented strategies significantly improved detection accuracy and reduced labelling effort, each is strongly focused on a specific aspect of the detection problem. These findings motivate the present study.
Motivated by recent studies that combine multiple acquisition criteria to accelerate active learning [18,19], this work formulates olive detection in orchard imagery as a multi-criteria optimization problem that explicitly accounts for annotation cost. Instead of selecting samples based on a single heuristic, this method combines uncertainty, object scale, and scene density within a unified optimization framework that simultaneously considers diverse selection parameters. Various annotation-cost models were evaluated using different selection strategies to assess their effects on the ranking of selected samples and the importance of specific parameters. Experimental results show that explicit cost modelling regularizes sample-selection behaviour by suppressing extreme acquisition patterns, as evidenced by reduced annotation effort and convergent detection performance across distinct selection sets. At the same time, the benefits of informed selection and the influence of other active learning parameters remain visible. Evidence for this is provided by Jaccard plots, which show that different selection patterns conditioned by the annotation cost model yield comparable performance. Moreover, simple scalar combinations of normalized criteria often achieve performance comparable to that of Pareto-based selection and distributed weighting schemes across several cost models, suggesting that cost formulation and criterion interaction can be as influential as the choice of optimization strategy in this context.
The main contributions of this work are as follows:
  • An active learning framework is proposed that selects images using multiple criteria, including prediction uncertainty, object size, scene density, and annotation cost. This design directly addresses the challenges of detecting small and densely packed objects in olive orchard imagery.
  • Several annotation cost models and acquisition strategies are evaluated through experiments. These include both scalarized selection and Pareto-based selection in an iterative active learning setting.
  • Ablation analysis is used to study the role of individual selection criteria. The results show that the importance of uncertainty, object scale, and density depends on the assumed cost model, with scene density consistently contributing to learning efficiency.
  • Selection behaviour is analyzed using Jaccard overlap. The results show that different acquisition strategies can select very different image sets while achieving similar detection performance, highlighting the importance of how selection criteria interact.
  • The study demonstrates that explicit modelling of annotation cost and careful design of selection criteria strongly influence active learning behaviour and annotation efficiency. At the same time, different optimization strategies can achieve comparable detection performance despite selecting different subsets of training images.

2. Methods

2.1. Problem Formulation

This work studies active learning for object detection in olive orchard images. The focus is on detecting small objects under limited annotation budgets. The dataset is divided into a labelled set and an unlabelled pool. At each active learning iteration, an object detector is trained using the currently labelled data. The goal is to select a subset of unlabelled images for annotation that improves detection performance while accounting for the amount and cost of new labels.
Unlike standard active learning methods that rely on a single selection criterion, the proposed approach formulates sample selection as a multi-criteria optimisation problem. It jointly considers model uncertainty, object scale, scene density, and annotation cost. This formulation is well-suited to agricultural imagery, where objects are often small, densely distributed, and expensive to annotate. Annotation cost is estimated using a proxy computed from model predictions, since the true annotation effort is not known at selection time.
Because these criteria may conflict with one another, the selection problem is inherently multi-objective and cannot be optimised using a single objective without approximation. For this reason, the study evaluates both scalarised selection methods and Pareto-based selection as practical approximations to the underlying multi-objective problem.

2.2. Database Used for Evaluation

2.2.1. Split Olive Dataset v1

The experiments were conducted on the Split Olive Dataset v1, described in detail in [17,20]. The original dataset consists of 91 high-resolution orchard images acquired under real field conditions and includes multiple olive varieties. To increase the number of training samples and to standardize detector input, each original image was tiled into patches of 640 × 640 pixels, resulting in a total of 1038 image tiles. All tiles were annotated with axis-aligned bounding boxes for a single class (olive fruit).
The validation and test subsets remain fixed throughout the entire active learning procedure and are never queried by any acquisition strategy. Ground-truth annotations are available only for the initial labelled batch and are revealed for pool images only after selection, following an oracle-based active learning setting.
Across all splits, the dataset contains 11,227 annotated olive instances. The training data comprise 7821 objects distributed across 861 images, while the validation split contains 2217 annotated objects. The mean bounding-box size is 28.224 px × 30.208 px, highlighting the small-object nature of olive detection in orchard imagery.
This partitioning protocol directly reflects the implemented pipeline logic: the bootstrap procedure described in detail in Section 2.7.

2.2.2. Tacna Olive Dataset

The Tacna Olive Dataset originally consists of 503 high-resolution images of 62 Sevillana olive trees captured at a resolution of 6000 × 4000 pixels. In the original dataset release, each image was tiled into 12,072 image patches of size 1000 × 1000 pixels, which were subsequently annotated and split into training (8549 images), validation (2179 images), and test sets (1344 images) [21].
For the purposes of this study, an additional tiling step was applied to match the detector input resolution used in the active learning pipeline. Specifically, the 1000 × 1000 patches were further divided into tiles of 640 × 640 pixels, which corresponds to the fixed input resolution used for YOLOv8 training (imgsz = 640). This additional tiling increases the number of training samples while maintaining the original dataset split between training, validation, and test subsets.
Images in the Tacna dataset were captured using a Canon EOS Rebel T6i camera [21] equipped with a CMOS sensor (22.3 mm × 14.9 mm) and a resolution of 24.2 MP. Photographs were taken between 10:00 and 17:00 h, with eight images captured per tree from different viewpoints.
The dataset contains 245,089 annotated olive instances, distributed as 167,504 objects in the training set, 41,428 in the validation set, and 36,157 in the test set. Olives were labeled using a semi-automatic annotation procedure described in [21]. The average bounding-box size is 25.3 px × 30.5 px, confirming the small-object detection characteristics of the dataset.

2.3. Multi-Criteria Selection Signals

For each unlabelled image x D U ( t ) , a set of selection criteria derived from the current detector’s predictions is computed. These criteria subsume the aspects of informativeness and annotation effort, which are complementary. Therefore, to reflect the various sources of detection difficulty and labelling effort in orchard imagery, four complementary selection criteria are defined to measure prediction uncertainty, small-object relevance, scene density, and annotation cost.
Model uncertainty is estimated using the confidence scores of predicted detections. Given an image x with N x predicted objects, each associated with a confidence score p k [ 0 , 1 ] , the uncertainty score is defined as the mean inverse confidence across predicted detections:
U ( x )   =   1 N x k = 1 N x ( 1 p k ) .
This formulation assigns higher uncertainty to images containing many low-confidence detections, which are likely to yield informative supervision upon annotation.
To explicitly prioritize images containing small target objects, we introduce a scale-aware criterion based on predicted bounding box areas. Let a k denote the area of detection k in image x . We define a normalized small-object relevance score as
S ( x ) = 1 N x k = 1 N x ( 1 a k A m a x ( t ) ) , A m a x ( t ) = max x D U ( t ) max k x a k
where A m a x ( t ) = max x D U ( t ) max k x a k is the maximum bounding box area observed across the unlabeled pool at iteration t . This normalization ensures scale invariance across images and favours samples containing smaller objects, which are typically more difficult to detect in orchard environments.
Images with multiple objects often provide richer supervision signals. We therefore incorporate a density-based criterion defined as
D ( x ) = l o g ( 1 + N x ) ,
which increases with the number of predicted objects while avoiding excessive emphasis on extremely dense scenes.
To account for the human effort required for labelling, we incorporate an annotation cost signal that penalizes images with high annotation workload. Cost is estimated using a proxy derived from model predictions, reflecting factors such as object count and annotation volume. This signal acts as a regularizing term during sample selection, discouraging the acquisition of disproportionately expensive images. The specific cost formulations evaluated in this study are detailed in Section 2.7.2.

2.4. Unified Acquisition Function

The active learning problem is formulated as a multi-objective optimization task. For each image x D U ( t ) , we seek to maximize informativeness, expressed through uncertainty, small-object relevance, and density, while simultaneously minimizing annotation cost. Formally, this can be written as
m a x ( U ( x ) , S ( x ) , D ( x ) ) , m i n ( C ( x ) ) ,
subject to a labelling budget constraint. The proposed scoring function can be interpreted as a weighted scalarization of an underlying Pareto optimization problem, enabling efficient ranking of unlabelled samples while approximating the trade-offs between competing objectives.
In practice, solving a full Pareto optimization problem at each iteration can be computationally expensive. Therefore, a scalarized approximation that combines the criteria into a single ranking score is adopted:
Score t ( x ) = w U U ( x ) + w S S ( x ) + w D D ( x ) 1 + w C C ( x ) ,  
where w U ,   w S ,   w D , and w C are non-negative hyperparameters that control the relative importance of each criterion. Images are ranked according to this score, and the top B t samples are selected for annotation.

2.5. Pareto-Based Selection Strategy

In addition to the scalarized acquisition function defined in Equation (5), we also evaluate a Pareto-based selection strategy that treats sample acquisition as a multi-objective optimization problem. Instead of combining the criteria into a single score using predefined weights, Pareto ranking is performed using non-dominated sorting, where candidate samples are grouped into successive Pareto fronts according to dominance relations [22].
For each image x in the unlabelled pool D U ( t ) , a four-dimensional objective vector is constructed:
(U(x), S(x), D(x), −C(x))
where uncertainty U(x), scale relevance S(x) and scene density D(x) are maximized, while annotation cost C(x) is minimized.
An image x i is said to Pareto-dominate another image x j   if it is no worse in all objectives, and if it is strictly better in at least one objective.
All non-dominated samples form the first Pareto front. After removing these samples, the procedure is repeated to construct subsequent fronts until all candidates are ranked.
To obtain an ordering within each Pareto front, non-dominated sorting is followed by crowding-distance selection when only a subset of samples from the current front can be included in the batch. This favours candidates located in less crowded regions of the objective space and preserves diversity among selected samples.
Compared with scalarized ranking, the Pareto-based approach does not require manual specification of trade-off weights between criteria. Instead, it preserves the multi-objective structure of the acquisition problem and allows multiple candidate samples with different trade-offs between informativeness and annotation cost to be considered simultaneously.
This property is particularly useful in active learning scenarios where the relative importance of uncertainty, object scale, scene density, and annotation cost may vary across learning stages or datasets.
Considering both approaches as complementary approximations to the underlying multi-objective selection problem, their behaviour was thoroughly evaluated across multiple annotation cost formulations, with the comparative analysis detailed in the experimental section.

2.6. Iterative Active Learning Procedure

At each active learning iteration, the detector is trained on the current labelled set, predictions are generated for the unlabelled pool, and selection criteria are computed for all candidate images. A subset of images is then selected according to the multi-criteria ranking rule and annotated by human experts. The newly labelled samples are added to the training set, removed from the unlabelled pool, and the detector is retrained. This process is repeated until the labelling budget is exhausted or performance improvements fall below a predefined threshold. The algorithm box (Algorithm 1) below clearly indicates the performed steps.
By jointly optimizing multiple selection objectives and explicitly accounting for annotation cost, the proposed method aims to improve learning efficiency and robustness, particularly in the challenging setting of small-object detection in agricultural imagery.
To analyse the contribution of individual selection criteria and their interactions, a systematic ablation of the unified acquisition function is further considered. Therefore, one criterion weight is set to zero, while the remaining criteria are kept unchanged, thereby isolating the functional roles of uncertainty, scale, density, and cost within the multi-criteria selection framework. This ablation procedure is applied consistently across optimization strategies and cost models and serves both as an analytical tool and as a basis for designing adaptive weighting schemes evaluated in Section 3.
Algorithm 1: Multi-Criteria Cost-Aware Active Learning
Input:
 Fully annotated dataset D
 Initial batch size B1
 Batch increment B
 Max iterations T
 Cost formulation C(x)
 Weight schedule W_t
1: Partition D into:
    -BATCH1 (initial labeled set)
    -POOL
    -VALIDATION
    -TEST
2: Train detector on BATCH1
3: For t = 1 to T:
4:   Predict detections on POOL
5:   For each image x in POOL:
6:     Compute U(x), S(x), D(x)
7:     Compute cost C(x)
8:     If scalar:
9:       Compute Score_t(x)
10:    If Pareto:
11:      Perform non-dominated sorting
12:  Select top B images
13:  Reveal ground-truth labels
14:  Add to training set
15:  Remove from POOL
16:  Retrain detector
17: End For

2.7. Experimental Setup

In this study, an iterative active learning for single-class small-object detection in orchard imagery is analysed, focusing on olive detection under limited annotation budgets. All experiments follow a fixed data partitioning protocol to ensure fair comparison across acquisition strategies.
Starting from a fully annotated dataset, labelled in Figure 1 as FULLY ANNOTATED DATASET, images are partitioned once into four disjoint subsets: an initial labelled training set (BATCH1), an unlabelled pool from which samples are queried during active learning, a validation set used for model selection during training, and a held-out test set used exclusively for evaluation. The test set is never queried during active learning.
Performance is evaluated after each iteration using mAP50 and mAP50–95 on the held-out test set.
At each iteration t , the detector is trained on the accumulated training set and a validation set is used for model selection. The selected model is then applied to the unlabeled pool, from which a new batch is selected and added to the training set. The test set is reserved exclusively for evaluation and is never queried during active learning. The flowchart
To establish a clear experimental reference for evaluating different acquisition strategies, two baseline training regimes are first considered.
  • Full Supervision—under which the detector is trained once using the complete labelled training set. This setting serves as an upper bound on achievable performance given the available annotations.
  • Random Sampling, as a baseline active learning strategy, images are selected uniformly at random from the unlabeled pool at each iteration. This baseline isolates the effect of the iterative training process itself, without relying on any informed acquisition criterion. In other words, it reflects the performance improvement achievable purely by gradually increasing the training set size during the active learning cycle. To ensure comparability, the same initial training set (BATCH1), obtained during the bootstrap procedure, is used in all experiments.
Within the active learning framework, these reference regimes are compared with several acquisition strategies, including uncertainty-based sampling, scalarized multi-criteria selection, and Pareto-based multi-objective selection.

2.7.1. Detector Architecture and Training Details

The active learning framework is detector-agnostic. Our particular design assumes that, in all experiments, the YOLOv8 framework is instantiated [23]. It is chosen for its strong speed–accuracy trade-off and suitability for deployment with small, densely clustered objects.
In the first iteration, training begins with a YOLOv8 initialization (using pretrained weights or the prior best checkpoint, depending on the pipeline setting). In subsequent iterations, the best checkpoint from the previous iteration is fine-tuned (i.e., the most recent best.pt), with the “continue training” logic already implemented in the pipeline. This stabilizes learning across iterations and reflects realistic annotation workflows where improvements are incremental.
During training, a fixed image resolution and batch size are used ( i m g s z   =   640 , b a t c h   =   16 ), and a fixed number of epochs per iteration ( E P O C H S   =   15 ). Learning rate settings follow a dual pattern: an initial training rate for the bootstrap model and a lower fine-tuning rate for subsequent iterations, aligned with standard active learning setups [1].

2.7.2. Annotation Cost Modeling and Empirical Cost Measurement

To account for the non-uniform effort required to annotate orchard imagery, the proposed acquisition function incorporates an annotation cost proxy that is available at selection time. For each unlabelled image x , one of the cost formulations is estimated as
C ( x ) = N x a x ,
where N x denotes the number of predicted objects in the image and a x is the mean predicted bounding-box area. This formulation reflects the overall annotation workload by penalizing images with many objects and larger annotation volumes.
To analyse the sensitivity of active learning behaviour to different notions of annotation cost, we additionally evaluate three alternative cost proxies as ablation settings. A density-only formulation,
C ( x ) =   N x ,
models annotation effort solely as a function of object count, while an inverse-area formulation,
C ( x ) = N x a x + ε ,
penalizes images dominated by many small objects, which are often more difficult to annotate precisely. In all experiments, ε = 10−6, and represents a small constant (ε > 0) introduced to prevent numerical instability when the mean bounding-box area approaches zero. The constant ε was fixed in all experiments and was not treated as a tunable hyperparameter. In addition, a random fixed-cost model is evaluated, in which each image is assigned a random cost that remains constant across iterations and is independent of image content or model predictions. This setting serves as a control condition to assess whether improvements attributed to cost-aware selection arise from meaningful cost structure or merely from regularization effects introduced by an additional criterion.
The annotation cost proxy is used only to rank images during each active learning iteration. It is not used to estimate the actual time needed for annotation. In this work, annotation effort therefore denotes a proxy quantity derived from prediction-based scene statistics rather than directly measured human annotation time. Direct measurement of annotation time was outside the scope of the present study. Instead, annotation cost was approximated using prediction-derived proxies that correlate with annotation workload, such as object count and bounding-box area. Although the proposed proxies cannot fully capture human cognitive annotation effort, they provide a practical approximation of the annotation workload that is available during the sample selection stage.
The proposed framework can easily be extended in future studies to include real measurements of annotation effort and to analyse the relationship between cost proxies and actual human effort.

2.7.3. Validation of Prediction-Derived Annotation Proxies

In this paper, the annotation cost and scene statistics are evaluated using cost proxies rather than ground-truth values. Given that these proxies influence the ranking strategy, it is important to check whether they correlate with the ground-truth image features.
To evaluate this relationship, statistics based on estimation have been calculated and compared with ground-truth values collected from the bootstrapped training set. In particular, the correlation between predicted and ground-truth object counts and between predicted and ground-truth mean bounding-box area has been correlated and evaluated using Pearson and Spearman correlation coefficients (Figure 2).
The results (Table 1) indicate a correlation between the number of predicted and actual objects, as evidenced by a high Pearson correlation coefficient (r = 0.89). Therefore, it is justified to use density estimates as a reliable proxy for scene complexity. A similar conclusion can be applied to the connection between the area of the estimated bounding boxes and those recorded in the ground truth image (Pearson r = 0.73). This is why such a cost proxy can be justifiably used as a substitute for the real one.
These findings validate the use of prediction-derived statistics within the proposed active learning framework and support their role as practical proxies for annotation cost and scene difficulty. Since the acquisition criteria are derived from model predictions, they may be noisier in the earliest active learning iterations when the detector is less accurate. However, the observed correlation with ground-truth scene statistics supports their use as practical ranking signals during the acquisition process.

2.8. Experimental Protocol Overview

The proposed multi-criteria active learning framework was evaluated for olive detection using a YOLOv8-based detector. The initial model was generated by a bootstrap procedure, performed once and used in all experiments. Iterative retraining using newly selected images from the unlabelled pool is repeated until the pool is depleted. In each active learning iteration, approximately 150 images were selected for annotation and incorporated into the training set. In the reported experiments, the active learning budget is defined by a fixed number of acquisition iterations with a constant batch size of approximately 150 images per iteration. In all experiments, the fixed batch size is used to ensure that observed performance differences result from the acquisition strategy rather than variations in annotation volume.
In contrast to the previously published study [16], which used more iterations with smaller or variable increments, the current experimental design employs fewer iterations with larger, fixed increments. The applied multi-criteria framework enables analysis of learning dynamics and selection behaviour, which helps in robustly solving the specific object detection problem, such as olives in orchard imagery, which was the primary research goal, rather than focusing on fine-grained optimization of peak detection performance.
Detection performance was evaluated on a held-out test set using mAP50 and mAP50–95 metrics. The presented results relate to the main reported cost formulation, defined as C = N. Alternative cost proxies were analysed separately to evaluate the robustness of the framework, defined as the stability of detection performance across different annotation cost formulations.
The active learning process therefore stops after the predefined number of acquisition iterations rather than by exhausting a cost budget. Annotation effort is analysed retrospectively using the cumulative number of annotated objects.

3. Results

3.1. Baseline Comparisons of Acquisition Strategies

The aim of the research presented in this paper is to understand how the annotation cost structure affects the active learning pipeline. Therefore, it was necessary to define and implement baseline experiments that serve as a reference for the subsequent analysis of active learning strategies that account for different annotation cost models.
The first reference corresponds to a fully supervised learning environment in which the available labelled data is divided into training, validation, and testing sets in a conventional manner. This configuration provides a reference model of inactive learning and establishes an upper bound on the performance for the detection task.
In the second reference line, the active learning procedure uses random sampling, selecting images from the unlabelled set without a criterion. The selected images are then labelled, added to the training set, and the model is updated with each newly added group of images until the set of available images is completely empty. This strategy serves as a neutral reference that reflects the performance of the active learning pipeline without informed selection.
To provide a stronger comparative context, three additional acquisition strategies were evaluated. The first one selects the samples with the highest prediction uncertainty, which is one of the most commonly used approaches in active learning. The second one proposes a multi-criteria strategy that combines multiple image features (uncertainty, scale, and sample density) using scalar aggregation with equal weights, while the Pareto-based selection strategy introduces multi-criteria selection without scalar aggregation.
Figure 3 summarizes the learning dynamics of all four acquisition strategies across active learning iterations. The results are averaged over three random starting points, and the mean performance and standard deviation for the evaluation metrics mAP50 and mAP50–95 are also shown.
It is worth noting that the random strategy competes with more complex strategies, confirming that a significant portion of the performance improvement in the active learning pipeline is due to the iterative sample-collection and retraining process itself. For this reason, random sampling provides an important reference baseline for analyzing the influence of structured acquisition strategies and annotation-cost modelling.
The uncertainty-based strategy shows somewhat more stable behaviour than random sampling, due to prioritizing uncertain predictions. The proposed multi-criteria strategy has comparable performance. Like the uncertainty-based strategy, it is more stable than the random strategy and maintains stable learning dynamics across iterations.
Among the evaluated strategies, Pareto-based acquisition consistently achieves the highest performance values for both evaluation metrics. The behaviour of the Pareto strategy suggests that multi-criteria selection, as the basis for active learning, can improve the balance between competing criteria during sample selection.
In addition to learning curves, the normalized Area Under the Learning Curve (AULC) is included in Table 2. AULC is widely used in active learning studies as a measure of learning efficiency, since it summarizes the entire learning trajectory rather than only final performance. Pareto selection achieves the highest AULC for both mAP50 (0.719 ± 0.002) and mAP50–95 (0.400 ± 0.001), outperforming uncertainty-only and multi-criteria scalarization, while random sampling shows the largest variability (e.g., mAP50 AULC 0.689 ± 0.032). The higher AULC indicates that the Pareto strategy improves learning efficiency across the active learning process, reaching higher performance earlier compared with the baseline strategies. In addition, the lower standard deviation across seeds suggests more stable sample selection behaviour.
This baseline comparison provides the foundation for the subsequent analysis of how different annotation-cost models influence sample selection behaviour. The results obtained on the second dataset are also summarized in Table 2. Using two datasets with different scales and annotation densities provides additional validation of the robustness and generalization of the proposed active learning framework.
Unlike Dataset 1, where each acquisition strategy was evaluated with three random bootstrap seeds to evaluate the stability of the acquisition strategies, the experiments on Dataset 2 (Figure 4) were performed with a single bootstrap seed due to the substantially larger training set and computational constraints. This setting is common in large-scale object detection experiments where training costs are substantially higher. Although variability estimates are not available for Dataset 2, the relative ordering of acquisition strategies remains consistent with Dataset 1, supporting the robustness of the observed trends.
Despite differences in how the experimental results are presented, Table 2 shows consistency in the data. The multi-criteria and Pareto acquisition strategies achieve the highest final detection performance and the highest AULC values, indicating improved learning efficiency compared to the baseline strategies.
Interestingly, random sampling achieves relatively good results on Dataset 2, suggesting that the “plain” iterative active learning process itself contributes significantly to performance improvements. However, informed acquisition strategies provide additional gains in both final performance and cumulative learning efficiency.
Overall, results across two datasets with different characteristics show that the proposed multi-criteria active learning framework remains stable and competitive, even when the baseline performance of random sampling is already high.
Also, two datasets exhibit different learning dynamics. Dataset 1 represents a smaller-scale setting in which performance differences between acquisition strategies are more pronounced, particularly in learning efficiency (AULC). In contrast, Dataset 2 shows higher absolute performance and smaller performance gaps between strategies. This behaviour suggests that as the available training pool grows, the contribution of the iterative retraining process itself becomes more dominant, while the relative advantage of sophisticated acquisition strategies decreases.

3.2. Effect of Annotation Cost Modelling

To analyse how annotation cost influences the behaviour of the active learning pipeline, baseline experiments were first conducted to establish reference performance under different acquisition strategies. As explained in Section 2.7.2, the annotation effort is approximated using expected cost estimates for each image. These estimates are validated by comparing statistics obtained from model predictions with those derived from ground-truth annotations.
The observed correlations indicate that the predicted number of objects and the bounding box areas provide a reasonable approximation of the scene complexity and the effort required for annotation. Based on this formulation, the following experiments investigate how different annotation cost models affect image ranking and the dynamics of model training during active learning. Random sampling yields consistent, monotonic improvements in detection performance across active learning iterations, indicating that the iterative learning process itself can significantly improve the detector. Therefore, these results serve as a basic reference for analysing how different formulations of annotation costs affect acquisition behaviour and learning dynamics.
To provide qualitative insight into how different annotation cost models influence sample selection, Figure 5 shows representative images drawn from the same unlabelled pool at the same active learning stage. The selected examples illustrate typical image characteristics emphasized by each cost formulation. Under the inverse-area model C = N/(A + ε), selected images contain many small objects, which are costly to annotate under this proxy. In contrast, the area-weighted model C = N·A favours images with larger objects, while the density-only model C = N primarily emphasizes scenes with a high number of objects, regardless of object size. For reference, the figure also includes a representative image associated with the zero-cost configuration (ZCW).
The qualitative differences visible in Figure 5 help explain the different selection tendencies induced by the cost models. Cost-aware formulations balance informativeness with estimated annotation effort, whereas cost-free selection tends to favour more extreme candidates, dominated by small or low-confidence detections. To support this visual comparison, Table 3 reports the corresponding per-image values of uncertainty U , scale S , density D , and cost C for each image shown in Figure 5, computed from model predictions.
To gain further insight into the behaviour of the optimization procedure under different cost assumptions, Figure 6 presents results obtained using an equally weighted multi-criteria acquisition strategy for all cost-aware models, together with a random, image-fixed cost model and the ZCW configuration. Performance is evaluated using mAP50 and mAP50–95. The results show that the cost model penalizing dense scenes with small objects ( C = N / ( A + ε ) ) yields larger performance gains than random sampling, particularly in later training iterations. In contrast, the model emphasizing average object area ( C = N A ) and the density-only model ( C = N ) provide larger performance improvements in earlier iterations. This behaviour is consistent with general observations in active learning [1], in which images with clear, easily recognizable structures tend to be more informative in the early stages, whereas more difficult samples contribute more strongly in the later stages.
Clearer insight into the effect of cost modelling is obtained when learning curves are re-parameterized by cumulative annotation effort, approximated by the total number of annotated objects (Figure 6, right). Under this effort-based view, the apparent advantage of the ZCW configuration largely disappears. It requires the highest annotation effort to achieve comparable mAP50 levels, indicating lower sample efficiency. In contrast, cost-aware strategies achieve comparable or higher accuracy with substantially fewer annotated objects. This confirms that the superior iteration-based performance of the ZCW configuration observed in Figure 6 (left) is primarily due to higher labelling volume per iteration, rather than more efficient sample selection.
Among the evaluated cost models, the density-based formulation ( C = N ) achieves acceptable performance with the lowest annotation effort, reaching strong performance after annotating approximately 5000 objects. Other cost models reach similar final performance levels with higher annotation effort, but still earlier than the cost-free configuration. Overall, all strategies converge within a 5% range of final mAP50 values. This evaluation, based on cumulative annotation effort, better reflects real annotation workflows, where the practical objective is to achieve high detection performance with minimal labeling cost.
To quantify the annotation efficiency illustrated in Figure 6, Table 4 reports the number of annotated objects required to reach a target detection performance (mAP50 = 0.70). Cost-aware acquisition strategies reach the same performance level with fewer annotated objects compared with the cost-free configuration and random sampling. Among the evaluated approaches, the density-based cost model (C = N) provides the most favourable trade-off between annotation effort and detection accuracy.
These results demonstrate that annotation cost modelling improves sample efficiency and influences acquisition behaviour. To better understand how individual selection signals contribute under different cost assumptions, a detailed criterion-level analysis is presented in the following section.

3.3. Criterion Interaction and Weight Scheduling

To complement the zero-weight ablation analysis, additional single-criterion experiments were conducted in which only one acquisition signal was used at a time, while all other criteria, including annotation cost, were disabled. This analysis isolates the individual contribution of uncertainty, scale, and density to the active learning process.
Table 5 and Figure 7 summarize the results of single-criterion ablation experiments. Among the evaluated signals, density-based acquisition provides the strongest individual contribution, outperforming both uncertainty-only and scale-only strategies. Scale-only selection yields the lowest detection performance, indicating that prioritizing small-object relevance alone does not provide sufficiently informative samples. However, the full multi-criteria strategy still achieves the highest performance, confirming that combining complementary selection signals yields more effective active learning behaviour.
In the multi-criteria optimization framework, ablation can be achieved by setting the appropriate weights to zero. The results are presented as differences in mAP50 and mAP50-95 values between iterations during the active learning cycle, as shown in Figure 8. The results reveal typical patterns in the selection of training images. In all experiments, sample density in the image is the dominant criterion and remains important throughout the learning cycle. This is indicated by negative values of performance measure differences (deltas mAP), since excluding the criteria from the optimization goal degrades performance.
Uncertainty in individual experiments plays an appropriate role during early iterations, when the model benefits most from resolving ambiguous predictions (C = NA; Figure 8, second row). Object scale showed a stronger influence in intermediate stages or under density-driven cost formulations (Figure 8, third row; Figure 6, blue line). These observations confirm that the effectiveness of each criterion is not fixed but depends on both the learning stage and the cost model. Cost formulations that penalize dense scenes alter the relative contribution of density and scale, demonstrating a strong interaction between informativeness signals and annotation cost. Finally, to verify the structured patterns observed in the ablation analysis, the weight coefficients associated with the acquisition criteria were adjusted across active learning iterations, resulting in a scheduled-weight (SCHW) multi-criteria acquisition strategy. The schedule was derived from the qualitative trends observed in the criterion-level analysis rather than optimized using test-set performance. The weights were adjusted according to Table 6, which, based on ablation studies, indicates the most favorable sample-selection strategy and can be directly interpreted from the ablation study.
For example, under the cost C = N, where the model prefers dense samples, the density itself is correlated with cost, meaning that the numerator penalizes crowded scenes unless the numerator is sufficiently large. Figure 8 also suggests that U and S contribute to efficiency and have a greater impact only in the early stages of learning, which can be captured with decaying coefficients, since the impact of density increases across batches, inducing higher coefficient values. The results of the setup described are presented in Figure 9 (orange line), which shows that distributed coefficients tend to improve learning efficiency, particularly in later iterations.
The behaviour of the acquisition function is further shaped by its ratio-based structure (5). Because informativeness terms appear in the numerator and annotation cost in the denominator, the effective influence of each criterion depends not only on its relative weight but also on its absolute magnitude relative to the cost term. When the denominator dominates, score differences among candidate images are compressed, reducing discrimination between highly and moderately informative samples. This effect was observed under the initial weight configuration (Table 6), in which small numerator weights, combined with a strong cost term, limited selection contrast in early iterations.
The revised scheme (Table 7) increases the relative emphasis on density and gradually strengthens the influence of cost as learning progresses, while maintaining a higher magnitude of informativeness in early iterations (Figure 9—green line).
Early iterations prioritize the exploration of uncertain and dense scenes, enabling rapid performance improvements. As learning progresses, the increasing emphasis on density and cost shifts the selection strategy toward maximizing marginal gains under a constrained annotation budget. In contrast to the initial configuration, the revised scheme produces clearer score separation among candidates and aligns the mathematical behaviour of the acquisition function with the empirical importance of the criteria.
In summary, the improvement does not stem from altering the criteria’s conceptual priorities, but from adapting their numerical realization to the non-linear structure of the cost-aware acquisition function. This adjustment is essential for translating ablation-derived insights into effective active learning behaviour.

3.4. Optimization Strategy and Selection Behaviour

Overall, ablation analysis demonstrates that no single criterion consistently dominates across different annotation cost formulations. This sensitivity suggests that optimization procedures based on a scalarized approach with predefined trade-off weights contributing to the goal function may be brittle when the annotation cost is unknown. To address the observed sensitivity to the trade-off weighting scheme and cost modeling, Pareto optimization was employed. In contrast to scalarized approaches, Pareto optimization does not require a weighting scheme and maintains a set of non-dominated candidates that balance uncertainty, scale, density, and cost. This formulation naturally adjusts for variability in the importance of criteria, reducing reliance on sensitive and precise cost calibration.
Figure 10 shows the results for Pareto optimization when all criteria were included (equivalent to EW experiments under the scalarized goal function). The experiments demonstrate that Pareto-based selections achieve comparable detection performance while exhibiting lower sensitivity to specific annotation costs than scalarized strategies. In comparison with the scalarized approach, in which all mAP50 curves converge within a 5% range (Figure 6), Pareto mAP50 curves converge within 1% range (Figure 10).
Therefore, Pareto optimization offers a cost-robust alternative for active learning in dense object detection, where annotation costs cannot be determined a priori.
For real-world annotation workflows, Pareto-based selection offers a safe default strategy: it delivers competitive detection performance without requiring precise cost calibration, while allowing practitioners to identify and discard cost formulations that do not induce consistent or interpretable selection patterns. However, to justify the weighting logic of individual criteria, a similar procedure was applied to Pareto optimization. Because Pareto optimization considers all criteria simultaneously, the weights applied to the scalar form do not affect it. With Pareto, a minor influence of a criterion can be simulated by excluding that criterion from the optimisation procedure. Therefore, for the cost model C = N, the form of Pareto optimisation was adjusted in batches, as shown in Table 8. The cost model encodes density but not olive size, so excluding density could be problematic. Including S (scale) in later iterations is desirable, while excluding uncertainty in early iterations would reduce the search for differences in samples. To avoid issues in Pareto optimisation arising from replacing criteria (for example, in batch 4, uncertainty could be replaced by the scale criterion), a transitional form was chosen for batch 4 that includes all optimisation criteria.
Figure 11 shows the results of Pareto optimisation with all criteria included (blue line), as well as of optimisation excluding certain criteria due to their demonstrated negative contribution during the optimisation process (orange line). The figure demonstrates that the criteria selected during the optimisation procedure provide greater efficiency during the active learning cycle, and it is also comparable to the tuned scalarized optimization form, as indicated in Figure 12.
Beyond performance metrics, acquisition behaviour was analysed using Jaccard overlap between image sets selected by different strategies (Figure 13). Under volume-based cost formulations, C = N*A, (Figure 13, Right), overlap decreased gradually across iterations and stabilized at moderate values, indicating partial divergence in selected samples. In contrast, strongly conflicting cost formulation, such as inverse-area penalization, C = N / ( A + ε ) ,   (Figure 13, left), produced near-zero overlap between scalarized and Pareto selection, resulting in substantial divergence in acquisition behaviour.
Despite these differences in selected subsets, detection performance consistently converged across strategies. This combination of low selection overlap and comparable accuracy indicates that explicit cost modelling regularizes acquisition behaviour without enforcing a unique optimal subset. Instead, multiple acquisition paths can yield similar learning outcomes, provided that trade-offs between informativeness and annotation effort are appropriately constrained.
Qualitative inspection of selected samples supports these findings. Cost-free selection tends to prioritize scenes dominated by numerous small and low-confidence detections, whereas cost-aware strategies balance dense supervision against annotation burden. These differences reflect the criterion-level trends observed in the ablation study and help explain the absence of large performance gaps between strategies.
Although cost-aware strategies select substantially different image subsets, detection performance remains comparable. This combination of low selection overlap and convergent accuracy moderates extreme selection patterns while maintaining comparable detection performance.
Overall, the Jaccard analysis demonstrates that the proposed multi-criteria framework meaningfully influences acquisition behaviour without being overly sensitive to weighting or optimization strategy. Cost formulation plays a dominant role in shaping selection diversity, while detection performance remains robust across strategies. In this way, it is shown that cost-aware active learning based on a multi-criteria approach does not yield a unique optimal subset of images, but offers an interpretable mechanism that balances the informativeness of images and the effort invested in annotation.

3.5. Qualitative Comparison of Selected Samples

To provide qualitative insight into how different acquisition strategies influence sample selection, Figure 14 compares representative images selected during the BATCH3→BATCH4 active learning iteration.
The figure shows samples obtained using three multi-criteria acquisition strategies: MC-EW (equal weighting of the selection criteria), MC-SCHW (scheduled weighting where the relative importance of criteria changes across iterations), and MC-EW-ZCW (equal weighting with the cost term disabled). Each strategy selects images independently from the same unlabeled pool at the same learning stage.
YOLO predictions are overlaid on each image to visualize the density, scale, and spatial distribution of detected olives. These qualitative examples correspond to measurable differences in the underlying acquisition signals, including prediction uncertainty U ( x ) , small-object relevance S ( x ) , scene density D ( x ) , and estimated annotation cost C ( x ) . The comparison illustrates that cost-aware strategies (MC-EW and MC-SCHW) tend to select scenes that balance object density and annotation workload, whereas the cost-free configuration (MC-EW-ZCW) more frequently selects scenes dominated by numerous small and low-confidence detections. Despite these differences in visual characteristics, the resulting training samples provide comparable supervision for the detector, consistent with the convergence behaviour observed in the quantitative experiments.

3.6. Relation to Previously Published Combined Strategies

Compared to previously proposed heuristic combinations of uncertainty, scale, and density [17], the present framework achieves comparable final detection performance while offering improved interpretability and extensibility. Rather than optimizing a dataset-specific heuristic, the proposed formulation explicitly models annotation cost and treats acquisition as a structured multi-objective optimization problem.
The decreasing Jaccard overlap across iterations demonstrates that the framework meaningfully shapes acquisition behaviour rather than collapsing into a single dominant signal. At the same time, the convergence of detection performance across scalarized and Pareto strategies indicates that the underlying difficulty structure of dense orchard imagery can be exploited through multiple, cost-regularized acquisition paths. Direct comparisons with many recent active learning frameworks are difficult because most approaches target classification tasks or rely on different detector architectures and dataset protocols.

3.7. Summary of Findings

The experimental results lead to the following findings:
  • The proposed multi-criteria active learning framework improves detection performance over the initial bootstrap model and consistently outperforms random sampling across all active learning iterations and cost formulations.
  • Cost-aware strategies reach similar or higher accuracy using fewer annotated objects. This shows that explicit cost modelling improves annotation efficiency rather than simply changing the shape of iteration-based learning curves.
  • The results across two datasets with different annotation densities indicate that the proposed framework is robust to dataset scale and maintains consistent acquisition behaviour.
  • Ablation analysis demonstrates that scene density is the most influential selection criterion across all cost models. The contribution of uncertainty and object scale depends on both the learning stage and the assumed annotation cost. Uncertainty is most pronounced in early iterations, whereas object scale becomes more relevant in intermediate stages or under density-based cost formulations.
  • Time-scheduled weighting strategies guided by ablation results improve learning efficiency compared to fixed equal weighting. These improvements arise from better balancing the relative influence of the selection criteria over time, rather than from optimizing for a single dominant criterion.
  • Analysis of selection behaviour using Jaccard overlap shows that different acquisition strategies often select increasingly different image sets as learning progresses. The timing and degree of this divergence depend on the annotation cost model. Despite these differences in selected samples, detection performance converges across strategies.
  • Explicit modelling of annotation cost plays a key role in shaping selection behaviour by reducing extreme or overly aggressive sample selection. In contrast, the choice between scalarized and Pareto-based optimization has a smaller effect on final detection accuracy.
Overall, these findings indicate that the main value of the proposed framework lies in its clear formulation, interpretability, and robustness across different design choices, and in providing a principled, cost-aware framework for multi-criteria active learning under realistic annotation constraints.

4. Conclusions and Future Work

This work proposed a multi-criteria, cost-aware active learning framework for small-object detection in agricultural imagery, explicitly incorporating prediction uncertainty, object scale, scene density, and annotation cost into the sample selection process. The proposed formulation generalizes and unifies several previously reported single-criterion and heuristic acquisition strategies within a coherent and interpretable optimization framework.
Experimental results demonstrate that the proposed approach consistently improves detection performance over random sampling and achieves accuracy comparable to previously published combined strategies. While iteration-based learning curves may suggest advantages for cost-free selection, re-parameterizing performance with respect to cumulative annotation effort reveals that cost-aware strategies achieve similar or higher accuracy with substantially fewer annotated objects. This finding highlights the importance of evaluating active learning methods with respect to annotation effort rather than iteration count alone.
Criterion-level ablation analysis shows that scene density is the most consistently influential factor across cost formulations, while the contributions of uncertainty and object scale depend on both the learning stage and the assumed annotation cost structure. Uncertainty primarily contributes during early iterations, whereas scale becomes more relevant in intermediate stages or under density-driven cost models. These results demonstrate that the importance of individual criteria is not fixed but modulated by cost assumptions and learning dynamics.
Analysis of selection behaviour using Jaccard overlap reveals that acquisition strategies can differ substantially in the image subsets they select, particularly under strongly constraining cost formulations. Importantly, such divergence does not imply degraded detection performance. Instead, the results indicate that explicit cost modelling regularizes acquisition behaviour by moderating extreme selection patterns and enabling multiple acquisition paths to yield comparable learning outcomes.
The comparison between scalarized and Pareto-based optimization further shows that both approaches can yield competitive performance within the proposed framework. While scalarized strategies allow explicit control through weighting and scheduling, Pareto-based selection offers a robust alternative that reduces sensitivity to precise weight specification and cost calibration. In dense small-object detection scenarios, where cost structure may be uncertain or difficult to estimate, Pareto optimization represents a safe and practical default.
Several directions for future work naturally arise from this study. First, Pareto-based selection could be further investigated in settings with weaker correlations between selection criteria to better understand trade-offs between informativeness and annotation cost. Second, curriculum-driven and phenology-aware scheduling strategies offer promising opportunities to adapt acquisition behaviour to both the learning stage and biological development. Finally, incorporating empirical measurements of annotation effort, such as time-based statistics collected from annotation tools, would enable direct validation of cost proxies and support more realistic cost-aware optimization.
Overall, these results suggest that simple, interpretable, and well-regularized multi-criteria acquisition functions provide a favourable balance between performance, robustness, and practical applicability for active learning in agricultural object detection under realistic annotation constraints. The results suggest that cost-aware multi-criteria selection can improve annotation efficiency without requiring complex acquisition heuristics.

Author Contributions

Conceptualization, M.B. and O.U.; methodology, M.B.; software, M.B. and O.U.; validation, M.B., O.U. and J.M.; formal analysis, M.B. and V.P.; investigation, M.B. and O.U.; resources, J.M. and V.P.; data curation, J.M. and V.P.; writing—original draft preparation, M.B. and O.U.; writing—review and editing, M.B., O.U., J.M. and V.P.; visualization, M.B., O.U. and J.M.; supervision, V.P.; project administration, V.P.; funding acquisition, V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Croatian Science Foundation under the project number HRZZ IP-2024-05-6393.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request. A dedicated project webpage providing access to the datasets used in this study is currently under development and will be made publicly available in the near future.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin–Madison: Madison, WI, USA, 2009. [Google Scholar]
  2. Gal, Y.; Islam, R.; Ghahramani, Z. Deep Bayesian Active Learning with Image Data. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017. [Google Scholar]
  3. Ren, P.; Xiao, Y.; Chang, X.; Huang, P.; Li, Z.; Gupta, B.; Chen, X.; Wang, X. A Survey of Deep Active Learning. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
  4. Sareer Ul, A.; Adnan, H.; Bumsoo, K.; Sanghyun, S. Deep learning based active learning technique for data annotation and improve the overall performance of classification models. Expert Syst. Appl. 2023, 228, 120391. [Google Scholar] [CrossRef]
  5. Griffioen, N.; Rankovic, N.; Zamberlan, F.; Punith, M. Efficient annotation reduction with active learning for computer vision-based Retail Product Recognition. J. Comput. Soc. Sci. 2024, 7, 1039–1070. [Google Scholar] [CrossRef]
  6. Brust, C.-A.; Käding, C.; Denzler, J.; Groenen, M. Active Learning for Deep Object Detection. In Proceedings of the ICCV Workshops, Seoul, South Korea, 27 October–2 November 2019. [Google Scholar]
  7. Garcia, D.; Carias, J.; Adão, T.; Jesus, R.; Cunha, A.; Magalhães, L.G. Ten Years of Active Learning Techniques and Object Detection: A Systematic Review. Appl. Sci. 2023, 13, 10667. [Google Scholar] [CrossRef]
  8. Sener, O.; Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  9. Ash, J.T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; Agarwal, A. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds (BADGE). In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 26 April–1 May 2020. [Google Scholar]
  10. Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  11. Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef] [PubMed]
  12. Käding, C.; Freytag, A.; Rodner, E.; Bodesheim, P.; Denzler, J. Active Learning for Autonomous Vision in Agriculture. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, South Korea, 9–14 October 2016. [Google Scholar]
  13. Yang, Y.; Wu, Z.; Zheng, H.; Yang, S. Active Learning for Crop Disease Recognition Using Deep Learning. Comput. Electron. Agric. 2020, 173, 105441. [Google Scholar]
  14. Xue, S.; Li, Z.; Wangb, D.; Zhu, T.; Zhang, B.; Ni, C. YOLO-ALDS: An instance segmentation framework for tomato defect segmentation and grading based on active learning and improved YOLO11. Comput. Electron. Agric. 2025, 238, 110820. [Google Scholar] [CrossRef]
  15. Haertel, R.A.; Seppi, K.D.; Ringger, E.K.; Carroll, J.L. Return on Investment for Active Learning. In Proceedings of the NIPS Workshop on Cost-Sensitive Learning, Whistler, BC, Canada, 12 December 2008. [Google Scholar]
  16. Kapoor, A.; Horvitz, E.; Basu, S. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 877–882. [Google Scholar]
  17. Bonković, M.; Uvodić, O.; Cecić, M.; Kuzmanić, A. Optimizing Olive Detection via YOLOv8 and Active Learning: Benefits of Uncertainty-Based and Missed-Detection Sampling Strategies. In Proceedings of the 2025 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 18–20 September 2025. [Google Scholar]
  18. Feng, Y.; He, J.; Wang, L.; Yang, W.; Deng, S.; Li, L. A Multi-Strategy Active Learning Framework for Enhanced Peripheral Blood Cell Image Detection. IEEE Access 2025, 13, 104815–104827. [Google Scholar] [CrossRef]
  19. Liu, R.; Ma, G.; Kong, F.; Ai, Z.; Xiong, K.; Zhou, W.; Wang, X.; Chang, X. Pareto-guided active learning for accelerating surrogate-assisted multi-objective optimization of arch dam shape. Eng. Struct. 2024, 326, 119541. [Google Scholar] [CrossRef]
  20. Musić, J.; Bonković, M.; Sikora, T.; Papić, V. Evaluation of deep neural network architectures and image datasets for olive fruit detection. In Proceedings of the 2025 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 18–20 September 2025. [Google Scholar]
  21. Osco-Mamani, E.; Santana-Carbajal, O.; Chaparro-Cruz, I.; Ochoa-Donoso, D.; Alcazar-Alay, S. The Detection and Counting of Olive Tree Fruits Using Deep Learning Models in Tacna, Perú. AI 2025, 6, 25. [Google Scholar] [CrossRef]
  22. Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
  23. Ultralytics. YOLOv8: State-of-the-Art Object Detection. Available online: https://docs.ultralytics.com (accessed on 6 March 2026).
Figure 1. Overview of the proposed iterative active learning pipeline. All data subsets originate from a single fully annotated dataset and are partitioned once into fixed training, validation, pool, and test sets. An initial labelled batch is obtained via bootstrap selection. The pipeline consists of four stages: dataset preparation, active learning iterations, model training, and final evaluation.
Figure 1. Overview of the proposed iterative active learning pipeline. All data subsets originate from a single fully annotated dataset and are partitioned once into fixed training, validation, pool, and test sets. An initial labelled batch is obtained via bootstrap selection. The pipeline consists of four stages: dataset preparation, active learning iterations, model training, and final evaluation.
Electronics 15 01196 g001
Figure 2. Predicted vs. ground-truth object counts for image tiles in the labelled batch. The dashed line indicates the ideal 1:1 relationship.
Figure 2. Predicted vs. ground-truth object counts for image tiles in the labelled batch. The dashed line indicates the ideal 1:1 relationship.
Electronics 15 01196 g002
Figure 3. Active learning performance across iterations averaged over three bootstrap seeds on Dataset 1. Mean and standard deviation are shown for four acquisition strategies: Random sampling, Uncertainty sampling, Multi-criteria selection, and Pareto-based selection. (Left): mAP50. (Right): mAP50–95. Shaded regions around each curve represent the standard deviation across multiple runs with different random seeds.
Figure 3. Active learning performance across iterations averaged over three bootstrap seeds on Dataset 1. Mean and standard deviation are shown for four acquisition strategies: Random sampling, Uncertainty sampling, Multi-criteria selection, and Pareto-based selection. (Left): mAP50. (Right): mAP50–95. Shaded regions around each curve represent the standard deviation across multiple runs with different random seeds.
Electronics 15 01196 g003
Figure 4. Active learning performance across iterations on Dataset 2. Results are shown for four acquisition strategies: Random sampling, Uncertainty sampling, Multi-criteria selection, and Pareto-based selection. The dashed line indicates the performance of the fully supervised detector trained on the entire labelled training set. (Left): mAP50. (Right): mAP50–95.
Figure 4. Active learning performance across iterations on Dataset 2. Results are shown for four acquisition strategies: Random sampling, Uncertainty sampling, Multi-criteria selection, and Pareto-based selection. The dashed line indicates the performance of the fully supervised detector trained on the entire labelled training set. (Left): mAP50. (Right): mAP50–95.
Electronics 15 01196 g004
Figure 5. Qualitative comparison of representative image tiles associated with different annotation cost models, drawn from the same unlabelled pool at the same active learning stage.
Figure 5. Qualitative comparison of representative image tiles associated with different annotation cost models, drawn from the same unlabelled pool at the same active learning stage.
Electronics 15 01196 g005
Figure 6. mAP50 performance across active learning iterations for equally weighted multi-criteria selection (Left) and detection performance as a function of cumulative annotation effort comparing cost-aware, cost-free, and random cost models (Right).
Figure 6. mAP50 performance across active learning iterations for equally weighted multi-criteria selection (Left) and detection performance as a function of cumulative annotation effort comparing cost-aware, cost-free, and random cost models (Right).
Electronics 15 01196 g006
Figure 7. Final test mAP50 for single-criterion and multi-criteria acquisition strategies on Dataset 1. Density-based selection provides the strongest individual contribution, while the multi-criteria strategy achieves the highest overall performance.
Figure 7. Final test mAP50 for single-criterion and multi-criteria acquisition strategies on Dataset 1. Density-based selection provides the strongest individual contribution, while the multi-criteria strategy achieves the highest overall performance.
Electronics 15 01196 g007
Figure 8. Ablation study showing the impact of excluding individual criteria on mAP50 (Left) and mAP50–95 (Right) across active learning iterations under different cost models: C = N/(A + ε) (First row); C = NA (Second row); C = N (Third row).
Figure 8. Ablation study showing the impact of excluding individual criteria on mAP50 (Left) and mAP50–95 (Right) across active learning iterations under different cost models: C = N/(A + ε) (First row); C = NA (Second row); C = N (Third row).
Electronics 15 01196 g008
Figure 9. mAP50 (Left) and mAP50-95 (Right) performance across active learning iterations comparing equally weighted and scheduled-weight multi-criteria selection under cost C = N .
Figure 9. mAP50 (Left) and mAP50-95 (Right) performance across active learning iterations comparing equally weighted and scheduled-weight multi-criteria selection under cost C = N .
Electronics 15 01196 g009
Figure 10. Learning curves for pareto selection comparing cost-aware, cost-free, and random cost models. (Left): mAP50; (Right): mAP50-95.
Figure 10. Learning curves for pareto selection comparing cost-aware, cost-free, and random cost models. (Left): mAP50; (Right): mAP50-95.
Electronics 15 01196 g010
Figure 11. Effect of scheduled weighting on Pareto selection performance (mAP50) under cost C = N .
Figure 11. Effect of scheduled weighting on Pareto selection performance (mAP50) under cost C = N .
Electronics 15 01196 g011
Figure 12. Comparison of Pareto-based and scalarized multi-criteria (MC) selection using scheduled weights under cost C = N , evaluated by (Left): mAP50 and (Right): mAP50-95 across active learning iterations.
Figure 12. Comparison of Pareto-based and scalarized multi-criteria (MC) selection using scheduled weights under cost C = N , evaluated by (Left): mAP50 and (Right): mAP50-95 across active learning iterations.
Electronics 15 01196 g012
Figure 13. Jaccard overlaps between image sets selected by different acquisition strategies across active learning iterations. Results are shown for two annotation cost formulations. (Left): inverse-area cost C   =   N / ( A + ε ) , (Right): volume-based cost C   =   N A , where overlap decreases gradually and stabilizes at moderate values, indicating partial divergence in selected samples.
Figure 13. Jaccard overlaps between image sets selected by different acquisition strategies across active learning iterations. Results are shown for two annotation cost formulations. (Left): inverse-area cost C   =   N / ( A + ε ) , (Right): volume-based cost C   =   N A , where overlap decreases gradually and stabilizes at moderate values, indicating partial divergence in selected samples.
Electronics 15 01196 g013
Figure 14. Qualitative comparison of image tiles selected during the BATCH3→BATCH4 active learning iteration using three multi-criteria (MC) acquisition strategies. The rows correspond to MC-EW (multi-criteria selection with equal weighting of uncertainty U , scale S , density D , and cost C ), MC-SCHW (multi-criteria selection with scheduled weights adapted across iterations), and MC-EW-ZCW (equal-weight multi-criteria selection with zero-cost weighting, where the cost term is excluded). Columns represent representative samples selected independently by each strategy from the same unlabeled pool at the same iteration stage.
Figure 14. Qualitative comparison of image tiles selected during the BATCH3→BATCH4 active learning iteration using three multi-criteria (MC) acquisition strategies. The rows correspond to MC-EW (multi-criteria selection with equal weighting of uncertainty U , scale S , density D , and cost C ), MC-SCHW (multi-criteria selection with scheduled weights adapted across iterations), and MC-EW-ZCW (equal-weight multi-criteria selection with zero-cost weighting, where the cost term is excluded). Columns represent representative samples selected independently by each strategy from the same unlabeled pool at the same iteration stage.
Electronics 15 01196 g014
Table 1. Pearson and Spearman correlations between prediction-derived and ground-truth scene statistics.
Table 1. Pearson and Spearman correlations between prediction-derived and ground-truth scene statistics.
Statistics P e a r s o n r Spearman ρ
Object count0.8920.782
Mean object area0.7370.840
Table 2. Baseline comparison of acquisition strategies for both datasets. Mean and standard deviation are computed over three random seeds for Dataset 1. AULC denotes the normalized area under the learning curve. The highest value for each metric is highlighted.
Table 2. Baseline comparison of acquisition strategies for both datasets. Mean and standard deviation are computed over three random seeds for Dataset 1. AULC denotes the normalized area under the learning curve. The highest value for each metric is highlighted.
MethodDataset 1 Final mAP50Dataset 1 AULCDataset 2 Final mAP50Dataset 2 AULC
Random0.768 ± 0.0090.689 ± 0.0320.8470.859
Uncertainty0.762 ± 0.0070.704 ± 0.0090.8130.854
Multi-criteria0.761 ± 0.0130.705 ± 0.0090.9220.902
Pareto0.770 ± 0.0030.719 ± 0.0020.9230.901
Table 3. Per-image uncertainty U , scale S , density D , and cost C values for the images shown in Figure 4.
Table 3. Per-image uncertainty U , scale S , density D , and cost C values for the images shown in Figure 4.
COST N / ( A + ε ) N*ANZCW
N 42494541
U0.8140.8160.7680.867
S0.9590.840.8920.852
D3.7613.9123.8293.738
Table 4. Annotation efficiency comparison across acquisition strategies. Values are derived from the cumulative annotation effort curves shown in Figure 6. The highest value for each metric is highlighted.
Table 4. Annotation efficiency comparison across acquisition strategies. Values are derived from the cumulative annotation effort curves shown in Figure 6. The highest value for each metric is highlighted.
MethodImages to Reach mAP50 = 0.70Objects Annotated
Random4205400
Uncertainty3905100
Multi-criteria3504600
Pareto3404500
Table 5. Single-criterion ablation on Dataset 1. Each strategy uses only one acquisition signal while all other criteria are disabled. Results correspond to the final active learning iteration (BATCH5).
Table 5. Single-criterion ablation on Dataset 1. Each strategy uses only one acquisition signal while all other criteria are disabled. Results correspond to the final active learning iteration (BATCH5).
StrategyActive Criterion Final mAP50FinalmAP50-95ΔmAP vs. MC
Uncertainty-onlyU0.7130.424−0.048
Scale-onlyS0.6970.404−0.064
Density-onlyD0.7200.445−0.041
Multi-criteriaU + S + D0.761 ± 0.013~0.430
Table 6. Weight coefficients for the optimization parameters.
Table 6. Weight coefficients for the optimization parameters.
BATCHUSDC
BATCH20.350.150.501.0
BATCH30.300.150.551.0
BATCH40.200.10.701.0
BATCH50.150.10.751.0
Table 7. Tuned weight coefficients for the optimization parameters.
Table 7. Tuned weight coefficients for the optimization parameters.
BATCHUSDC
BATCH21.20.61.80.6
BATCH31.00.52.20.7
BATCH40.80.42.80.8
BATCH50.60.33.20.9
Table 8. Tuned criteria inclusion for the Pareto optimization across batches.
Table 8. Tuned criteria inclusion for the Pareto optimization across batches.
BATCHUSDC
BATCH21.00.01.01.0
BATCH31.00.01.01.0
BATCH41.01.01.01.0
BATCH50.01.01.01.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bonković, M.; Uvodić, O.; Musić, J.; Papić, V. Cost-Aware Active Learning Framework for Efficient Small-Object Detection in Agricultural Images. Electronics 2026, 15, 1196. https://doi.org/10.3390/electronics15061196

AMA Style

Bonković M, Uvodić O, Musić J, Papić V. Cost-Aware Active Learning Framework for Efficient Small-Object Detection in Agricultural Images. Electronics. 2026; 15(6):1196. https://doi.org/10.3390/electronics15061196

Chicago/Turabian Style

Bonković, Mirjana, Ozana Uvodić, Josip Musić, and Vladan Papić. 2026. "Cost-Aware Active Learning Framework for Efficient Small-Object Detection in Agricultural Images" Electronics 15, no. 6: 1196. https://doi.org/10.3390/electronics15061196

APA Style

Bonković, M., Uvodić, O., Musić, J., & Papić, V. (2026). Cost-Aware Active Learning Framework for Efficient Small-Object Detection in Agricultural Images. Electronics, 15(6), 1196. https://doi.org/10.3390/electronics15061196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop