1. Introduction
Palynology, the scientific study of pollen grains and other micropaleontological entities, is of critical importance across diverse scientific domains, including paleoclimate reconstruction [
1], biodiversity monitoring [
2,
3], and allergenic disease research [
4]. Crucially, it is also a vital tool for modern agricultural science. In this domain, the precise identification and quantification of pollen are fundamental to crop phenotyping, optimizing pollination ecology for yield enhancement, and monitoring the spread of pollen-borne pathogens. The type and concentration of airborne pollen are key indicators for assessing environmental quality, forecasting allergy seasons, and monitoring agricultural productivity [
5,
6]. However, conventional pollen analysis relies heavily on manual microscopic enumeration by expert palynologists. This process is not only time-consuming and labor-intensive but is also prone to subjective error, severely limiting the throughput and timeliness of data acquisition [
7]. In typical monitoring setups, a trained palynologist can only process on the order of tens of slides per day, which is orders of magnitude below what is required for high-temporal-resolution phenological monitoring and large-scale biodiversity assessments. This limitation forms a critical data bottleneck, hindering our ability to monitor rapid ecological processes such as phenological shifts in response to climate change, the spread of invasive species, or real-time changes in plant biodiversity. Consequently, the development of automated, high-throughput, and high-precision pollen identification technologies has become a pressing demand in the field.
The advent of computer vision and deep learning has presented new opportunities for automated pollen identification [
8]. Object detection, a fundamental yet critical task in computer vision, aims to identify and localize instances of specific classes within an image [
9]. When applied to the analysis of microscopic images, particularly for pollen grain recognition, this task presents unique challenges. These challenges stem from the minute size of pollen grains, their embedding within complex backgrounds composed of dust and debris, their morphological diversity and tendency to aggregate, and their three-dimensional nature introduced by multi-focal plane imaging, which results in disparate appearances and sharpness for the same particle across different focal planes [
7,
10]. Early automated approaches primarily depended on traditional image processing techniques, such as color similarity transforms [
11], active contour models [
12], or feature engineering based on shape and texture [
13]. Although these methods achieved some success on specific datasets, they often required tedious manual feature design, and their robustness and generalization capabilities were often suboptimal when faced with real-world samples characterized by complex backgrounds, pollen aggregation, or morphological variability [
7,
10]. Deep learning, particularly the emergence of Convolutional Neural Networks (CNNs), offered a powerful new paradigm to address these issues [
8,
9]. By automatically learning hierarchical features from data, CNN-based object detection models have achieved revolutionary breakthroughs. In recent years, one-stage detectors, striking an exceptional balance between speed and accuracy, have rapidly become the mainstream approach for real-time object detection [
14]. These detectors frame the task as an end-to-end regression problem, thereby enabling real-time performance. In their architectural evolution, multi-scale feature representation is crucial for detecting objects of varying sizes. The Feature Pyramid Network (FPN), which constructs a feature pyramid via top-down pathways and lateral connections to effectively merge high-level semantic information with low-level spatial details, has become an indispensable component of modern detectors [
15]. Subsequent architectures combine these pyramid features with post-processing components like Non-Maximum Suppression (NMS) or Transformer-based decoders [
16,
17] for final detection.
The field of pollen identification and classification has also seen widespread application of these advances, as demonstrated by Kubera et al. [
18], who applied the YOLOv5 model to detect three pollen types with high morphological similarity, demonstrating performance superior to other detectors like Faster R-CNN and RetinaNet. Tan et al. [
19] also developed the PollenDetect system based on YOLOv5 for identifying pollen activity. These studies demonstrate the significant potential of deep learning models, especially the YOLO series, in automated palynology. However, the direct application of these general-purpose models to pollen identification is still confronted with distinct challenges. The minute size of pollen grains means they occupy a limited number of effective pixels in microscopic images [
7]. Consequently, relying solely on high-level semantic information is insufficient for high-precision recognition, rendering the effective extraction and utilization of boundary information paramount. Although some works have explored edge detection as an auxiliary task [
20], how to systematically and synergistically fuse these low-level edge cues with high-level semantic features within the network remains a critical and unresolved issue.
To this end, we design and propose HieraEdgeNet, a multi-scale, edge-enhanced framework specifically engineered for high-precision automated pollen identification. Our architecture builds upon the foundational concepts previously discussed but introduces deep customizations and innovations to address the specific challenges of pollen analysis. The core innovation of HieraEdgeNet lies in three synergistic modules. First, we designed the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale edge feature pyramid in the early stages of the network, corresponding to the semantic hierarchy of the backbone. Subsequently, we introduce the Synergistic Edge Fusion (SEF) module, responsible for the deep, synergistic fusion of these scale-aware edge cues with semantic information from the backbone at each respective scale. Finally, to achieve ultimate feature refinement at the most detail-rich feature level, we adapt and incorporate the Cross Stage Partial Omni-Kernel Module (CSPOKM) [
21].Through these synergistic modules, HieraEdgeNet significantly enhances the detection and localization accuracy of minute, indistinct targets. This technological advancement offers a robust solution for high-throughput palynological analysis, thereby unlocking its full potential for applications in precision agriculture. Specifically, it provides the technical foundation for generating large-scale, high-resolution datasets essential for modern biodiversity monitoring, tracking ecosystem responses to climate change, and paleoecological reconstruction, while also enhancing precision agriculture through applications in pollination assessment and crop health monitoring, thus contributing to both ecological sustainability and food security. In summary, this work makes three main contributions. First, we propose HieraEdgeNet, a palynology-specific detector that explicitly leverages a hierarchical edge feature pyramid and edge–semantic fusion to address the indistinct boundaries and minute size of microscopic pollen grains. Second, we integrate a Cross Stage Partial Omni-Kernel module that concentrates large-kernel computation at the most detail-rich scale, yielding a favorable accuracy–efficiency trade-off compared with YOLO-based baselines while keeping the overall computational cost moderate. Third, we construct a unified large-scale pollen detection benchmark from multiple public sources and synthetic compositions, and we systematically benchmark HieraEdgeNet against both YOLO and detection Transformer models, thereby providing a rigorous and reproducible evaluation of its performance.
2. Materials and Methods
Because this work targets an interdisciplinary audience, including palynologists, ecologists, and agricultural scientists who may not be familiar with modern object-detection architectures, this section briefly reviews the key modules and datasets on which our design builds before presenting the proposed HieraEdgeNet architecture.
2.1. Efficient Network Modules Based on Cross Stage Partial Ideology
To control computational costs while increasing network depth, efficient network architecture design is paramount. CSPNet (Cross Stage Partial Network) [
22] introduced an effective strategy that partitions the feature map entering a processing block along the channel dimension into two parts. One part undergoes deep transformation through the block, while the other is concatenated with the processed features via a short-circuit connection. This design reduces computational redundancy and enhances gradient propagation throughout the network, thereby improving model learning capacity and efficiency. In modern object detection architectures, the basic convolution operation is encapsulated within a standardized Conv block. This block conventionally combines a 2D convolution layer, a batch normalization layer, and an activation function in series. This Conv-BN-Act structure has been proven to effectively accelerate model convergence and improve training stability, while also providing a regularization effect [
23]. The C3k2 module, a core component for deep feature extraction in our architecture, is based on the design principles of CSPNet and adheres to the C2f paradigm [
24]. This module is typically configured with standard Bottleneck blocks, each containing two convolution layers. The output of each processing unit is concatenated with the output of the previous stage, achieving a dense feature aggregation akin to that in DenseNet [
25]. Finally, the features from all branches are concatenated and passed through a 1 × 1 convolution for final feature fusion. This design not only minimizes computational redundancy but also facilitates network learning through enriched gradient paths. The Area-Attention C2f (A2C2f) module embeds a region-based attention mechanism (AAttn) into the efficient C2f structure, endowing the model with dynamic, context-aware capabilities while retaining the advantages of CNN’s local inductive bias [
14]. The core of the A2C2f module is an ABlock, which comprises an AAttn module and a feed-forward network (MLP) with residual connections, structurally similar to a standard Encoder layer in a Vision Transformer (ViT) [
26]. The AAttn module implements multi-head self-attention but can constrain the computation to specific regions of the feature map via an area parameter, enabling a trade-off between global attention and computational cost.
To enable the network to capture multi-scale contextual information from a single-scale feature map, the Spatial Pyramid Pooling - Fast (SPPF) module was introduced as an efficient variant [
27,
28]. It employs a series of max-pooling layers to equivalently simulate the effect of parallel large-kernel pooling. An input feature map first undergoes channel reduction via a 1 × 1 convolution. Subsequently, this reduced feature map is passed serially through a max-pooling layer with a fixed kernel size (e.g., k = 5) and a stride of 1, three consecutive times. The initial down-sampled feature is then concatenated along the channel dimension with the outputs from the three serial pooling operations. Since two consecutive kxk pooling operations (with stride 1) have a receptive field approximately equivalent to a single
kernel, and three operations to a
kernel, SPPF with k = 5 serially applied three times can effectively capture receptive fields similar to those from parallel 5 × 5, 9 × 9, and 13 × 13 kernels in traditional SPP. Finally, the concatenated features are fused and their channels are adjusted by another 1 × 1 convolution. This serial design maintains multi-scale context awareness while significantly reducing computational complexity and improving inference speed.
2.2. Feature Extraction and Complex Object Optimization
The design of FPN enables modern object detectors to extract features at multiple scales, which is vital for detecting pollen grains of different sizes and locations. However, the limited resolution of microscopic images often results in objects that are minuscule with indistinct edges. Relying solely on standard multi-scale feature fusion is often insufficient for achieving precise detection in such contexts [
10]. Consequently, enhanced strategies tailored to specific object characteristics are necessary. The identification and classification of pollen grains depend on both the boundary delimiting the grain from its environment and its internal features, making it crucial to enhance the model’s representation capability for indistinct objects and edge details.
For specialized tasks, introducing prior knowledge as a strong inductive bias can significantly improve model performance and convergence speed. In the recognition of objects like pollen grains, precise boundary information is critical. The SobelConv module facilitates this by explicitly injecting first-order gradient information—i.e., edge features—into the network. A SobelConv module is essentially a fixed-weight 2D convolutional layer. Its kernel weights are initialized with the classic Sobel operators for computing horizontal and vertical image gradients [
29]. After initialization, these weights are frozen to ensure operational consistency. In CNNs, standard downsampling operations like strided convolutions or pooling layers inevitably cause spatial information loss, which is particularly detrimental to the detection and structural analysis of fine-grained objects like pollen. Although one could in principle relax this constraint and let the gradient kernels be learned end-to-end, we intentionally keep the Sobel filters fixed to provide a stable and interpretable edge prior while keeping the number of trainable parameters in the low-level edge extractor minimal. In the HEM, the subsequent
convolutions and fusion modules (SEF and CSPOKM) remain fully learnable, so the network can still adaptively reweight and combine these edge responses with semantic cues. This design follows the common practice of embedding hand-crafted operators as inductive bias in modern CNNs, and we leave a systematic exploration of learnable gradient kernels within HEM as promising future work.
In our design, SobelConv is intentionally implemented as a fixed, non-learnable operator that injects a stable and interpretable edge prior into the earliest feature extraction stage. This follows the long tradition of using small discrete gradient masks such as the Sobel operator as efficient and robust first-order edge detectors in image analysis [
30]. Our choice is motivated by three considerations: (1) it keeps the low-level edge extractor lightweight in terms of trainable parameters, which is beneficial for compact detectors; (2) it tends to improve optimization stability under the long-tailed and relatively limited pollen data distribution; and (3) Sobel filters have been successfully used to highlight salient intensity transitions along particle and pollen grain boundaries in bright-field and electron microscopy, including automated analyses of grass pollen surface ornamentation and bright-field pollen images [
31,
32,
33]. At the same time, we explicitly acknowledge that this constitutes a strong inductive bias and may limit the ability of HEM to capture atypical edge patterns that deviate substantially from Sobel-like gradients, especially for rare or highly irregular pollen morphologies. In this work, we rely on the subsequent learnable
convolutions in HEM and the downstream SEF and CSPOKM modules to adapt these fixed edge responses to task-specific structures, and we leave a systematic ablation comparing fixed SobelConv with a simple learnable
convolution as an interesting direction for future work.
To achieve lossless feature map downsampling, the Space-to-Depth Convolution (SPD-Conv) module was proposed [
34]. The core of this module is the “Space-to-Depth” transformation, also known as “Pixel Unshuffle.” This operation rearranges a spatial block of 2 × 2 pixels into the channel dimension, effectively halving the spatial resolution while quadrupling the channel depth, thus achieving downsampling without information loss.
Traditional convolutional networks rely on a homogeneous paradigm of stacking small-kernel convolutions. To mine and transform input features from omnidirectional, multi-dimensional, and multi-domain perspectives, the Omni-Kernel Core operator was developed [
35,
36]. The core of this module simulates diverse receptive fields through a set of parallel and heterogeneous depth-wise separable convolutional kernels. Within the Omni-Kernel, Multi-Scale and Anisotropic Convolutions employ various kernel shapes in parallel [
37], including a large square kernel (e.g., 31 × 31) for capturing broad context and multiple anisotropic strip kernels (e.g., 1 × 31 and 31 × 1). These anisotropic kernels are particularly effective for perceiving structures with specific orientations, such as pollen contours or textures, thus providing an “omnidirectional” perceptual capability. The Omni-Kernel also integrates attention mechanisms from both the spatial and frequency domains. Frequency Channel Attention (FCA) first learns channel attention weights in the spatial domain via global average pooling and 1 × 1 convolutions, then applies these weights to the Fourier spectrum of the feature map [
38]. This allows the model to dynamically adjust channel importance based on frequency composition. On the features processed by FCA, Spatial Channel Attention (SCA) further applies a classic Squeeze-and-Excitation type of channel attention for secondary calibration in the spatial domain [
39]. The Frequency Gating Module (FGM) acts as a dynamic frequency-domain filter, selectively enhancing or suppressing specific frequency components via a learned gating mechanism, further improving the model’s adaptability to complex patterns [
40]. The Omni-Kernel design is intended to enhance the expressive power for complex pollen morphologies, improve discrimination of subtle inter-class differences, and increase localization accuracy amidst complex backgrounds and ambiguous boundaries.
2.3. Datasets for Pollen Recognition
The development of robust, automated pollen recognition systems is critically dependent on large-scale, high-quality, and meticulously annotated datasets. Publicly available resources for pollen analysis, summarized in
Table 1, can be broadly categorized into two main types: those comprising pre-segmented, single-grain images designed for classification tasks, and those providing fully annotated microscopic scenes for object detection. These datasets collectively offer a diverse range of pollen types, imaging conditions, and annotation complexities, providing a comprehensive foundation for benchmarking. This study leverages a selection of these public datasets, aggregated into a unified large-scale benchmark comprising 44,471 images and 342,706 annotated instances, to evaluate the performance of the HieraEdgeNet architecture, aiming to advance the accuracy, robustness, and usability of automated pollen recognition. Taken together, these architectural and dataset preliminaries position HieraEdgeNet between efficient YOLO-style one-stage detectors and detection Transformers such as RT-DETR: our model retains the favorable inductive biases and efficiency of CNN-based backbones while introducing explicit multi-scale edge priors and Omni-Kernel-based feature refinement tailored to the microscopic pollen detection problem.
To address the challenges inherent in detecting pollen grains—characterized by their minute size, limited effective pixel count, and the critical reliance of precise localization on the effective differentiation of target edges from complex environmental backgrounds—we introduce HieraEdgeNet. This novel architecture is engineered to substantially enhance the recognition and localization accuracy of small objects, with a particular focus on pollen grains, by employing hierarchical and synergistic perception of edge information integrated with multi-scale feature fusion. The core innovation of HieraEdgeNet lies in the introduction of three pivotal structural modules: the Hierarchical Edge Module (HEM), the Synergistic Edge Fusion (SEF) module, and the Cross Stage Partial Omni-Kernel Module (CSPOKM) [
21]. These modules operate synergistically to form a network highly sensitive to edge information and endowed with robust feature representation capabilities. The architectural details of these three core modules are illustrated in
Figure 1.
2.4. Hierarchical Edge Module (HEM) for Multi-Scale Feature Extraction
HEM is engineered to extract and construct a pyramid of multi-scale edge information from input feature maps, thereby explicitly capturing fine-grained image details and salient edge characteristics. We denote feature maps from stage k of a network backbone as ; these are typically characterized by a spatial resolution downscaled by a factor of relative to the input image (e.g., features are at resolution, at , and so forth). The HEM initially employs a fixed Sobel operator, implemented as a convolutional layer with non-learnable weights (SobelConv), to independently compute gradients for each channel of an input feature map, . This is assumed to be from an early backbone stage, such as (i.e., resolution, e.g., with 256 channels), yielding an initial edge response map, , which preserves the spatial resolution and channel dimensionality of . Subsequently, max-pooling layers (MaxPool2d, kernel size 2 and stride 2) are iteratively applied three times to progressively downsample , generating edge maps at successively lower resolutions: at , at , and at relative to the original image input. Finally, these hierarchically generated edge maps (, , and ) are each processed by an independent convolution, which adjusts the channel dimensions to produce the multi-scale edge features (128 channels, derived from at resolution), (256 channels, from at resolution), and (512 channels, from at resolution). These edge features (, , ) therefore share the same spatial resolutions as the corresponding backbone stages , , and (i.e., , , and ), which allows SEF to fuse them with the semantic features via straightforward channel-wise concatenation without any additional interpolation or resampling.
Let
denote an input feature map. The HEM performs the following operations:
where
denotes the Sobel edge extraction implemented via fixed-weight convolution,
represents max pooling with a kernel size of 2 and a stride of 2, and
signifies a 1 × 1 convolutional block that adjusts the feature map to
output channels.
are the resulting multi-scale edge feature maps, with
being their respective channel dimensions (128, 256, and 512, as specified above).
The HEM furnishes the backbone network with explicit edge cues at multiple abstraction levels, designed for fusion with corresponding semantic features. This mechanism significantly aids the model in achieving more precise localization of object boundaries and in comprehending fine-grained structural details.
2.5. Synergistic Edge Fusion (SEF) for Integrating Priors
SEF is engineered to integrate core semantic information, crucial for pollen classification, with salient edge cues essential for precise localization. SEF operates by concurrently processing semantic features extracted at three primary spatial scales within the network backbone (P3/8, P4/16, and P5/32, corresponding to feature maps downsampled by factors of 8, 16, and 32, respectively). These semantic features are then fused with the corresponding scale-specific edge features generated by the HEM.
Let
denote the input semantic features and
represent the input edge features from the corresponding scale. The symbol ⨁ signifies concatenation along the channel dimension. The operations within the SEF module are defined as follows:
where
is the fused output feature map of the SEF module. Here,
specifies the number of output channels for the SEF module. The application of SEF modules at successive stages of the network, each operating on features of decreasing spatial resolution due to backbone downsampling, facilitates a continuous and hierarchical process of feature fusion. This mechanism allows SEF to effectively meld semantic and edge information, thereby substantially enhancing the model’s proficiency in both the classification and precise localization of pollen grains.
2.6. Cross Stage Partial Omni-Kernel Module (CSPOKM) for Feature Refinement
The primary objective of the CSPOKM is to enhance the flexibility and expressive power of feature extraction through the strategic incorporation of Omni-Kernel capabilities within a cross-stage, partial processing framework. This module is designed to effectively capture multi-scale features by leveraging inputs from different network stages. Specifically, CSPOKM introduces a learnable ensemble of convolutional kernels via the Omni-Kernel, which are characterized by their spatial sharing and channel-wise distinctions, enabling the model to flexibly combine and adapt feature representations at various levels of abstraction. Within HieraEdgeNet, the CSPOKM embeds the Omni-Kernel—a sophisticated operator comprising diverse large-scale, multi-directional depth-wise convolutions, augmented by attention and gating mechanisms such as SCA, FCA, and a FGM—into a CSP architecture. This configuration is employed for feature refinement within a critical path of the detection head, specifically at the P3 level, aiming to enhance the detail and spatial precision of features by integrating multi-scale contextual information.
The operational flow begins by concatenating features from disparate pathways:
(derived from P2/4 features processed by an SPD-Conv layer, designed to reduce spatial dimensions while preserving feature information),
(the fused backbone features at the P3/8 level), and
(obtained by upsampling P5/32 features). Let the concatenated input features be
. The subsequent operations within CSPOKM are detailed as follows:
Here, ⨁ denotes concatenation along the channel dimension. The function partitions the feature channels into two segments based on a ratio e (empirically set to in our architecture) and . The segment is processed by the Omni-Kernel module, denoted as , yielding . This output is then concatenated with the skipped connection and further processed by a convolutional block to produce the final output . Typically, to maintain channel consistency for feature refinement at a specific level, and are set equal to a target channel dimension , ensuring that the module’s input and output channel counts are concordant.
This sophisticated processing, particularly at critical feature levels rich in detail such as P3, endows the network with a powerful feature learning capability. It enables the capture of complex patterns and long-range dependencies that are challenging for conventional convolutional layers. Concurrently, the CSP architecture ensures computational efficiency. Such characteristics are particularly advantageous for improving performance in scenarios involving small objects or requiring fine-grained distinctions between classes.
Computationally, the CSPOKM utilizes a split ratio of and is applied exclusively at the P3 level to concentrate resources on small-object features. This design keeps the total inference cost at 14 GFLOPs. While higher than the YOLOv12n baseline, this overhead yields a decisive performance boost (mAP@0.5:0.95 ), outperforming much heavier Transformer-based models like RT-DETR-R18/R50. Additionally, the compressibility demonstrated by the LAMP variant (11.6 GFLOPs) indicates that the architecture effectively balances rich feature extraction with practical deployment constraints.
To make the price–performance trade-off of our design more transparent, we follow common practice in complexity-aware CNN and detector design [
37,
46] and explicitly profile the FLOP contributions of the three core modules using a
input and the profiling tools provided in the Ultralytics framework. Relative to the 6.7 GFLOPs of the YOLOv12n baseline, HieraEdgeNet’s total 14.0 GFLOPs can be decomposed as follows: the backbone and detection head account for essentially the same cost as YOLOv12n, HEM and SEF together contribute less than one third of the additional 7.3 GFLOPs, and CSPOKM accounts for the remaining majority. This indicates that most of the extra computation is concentrated in the large-kernel omni-kernel branch at the P3 level, whereas the hierarchical edge extraction (HEM) and fusion (SEF) modules provide comparatively modest overhead relative to their impact on accuracy. A detailed per-layer FLOP breakdown and the corresponding profiling scripts are released in our PalynoKit (version 0.0.1) repository to facilitate reproducibility and further analysis by practitioners. In this study, all profiling tools and experiments are implemented using the PalynoKit software package (version 0.0.1).
2.7. Architectural Integration of HieraEdgeNet
The HieraEdgeNet architecture is engineered to improve object detection for targets defined by fine-grained boundary information, a challenge prevalent in microscopic imaging. Its core design principle is the parallel and synergistic processing of semantic and edge features. The comprehensive architecture of HieraEdgeNet is depicted in
Figure 2.
The implementation involves three key stages. First, a HEM operates in parallel to the backbone’s initial stages, generating a pyramid of edge-maps that corresponds directly to the semantic feature hierarchy. Second, these scale-specific edge features are injected into the primary feature stream via SEF modules, creating a unified set of representations that encode both boundary and semantic information at each scale. Third, the network neck employs a Path Aggregation Network (PANet) architecture to facilitate robust, bidirectional feature aggregation across all levels of the now edge-enhanced pyramid. These deeply fused multi-scale features are then processed by the detection heads to yield final predictions, comprising bounding boxes, confidence scores, and class labels.
2.8. Detection Head and Loss Function
The detection architecture directly employs a decoupled head. During the forward propagation phase, for each detection level l, the input feature map is processed independently by the regression and classification heads. The regression head outputs , and the classification head outputs . During inference, predictions from all levels are concatenated and decoded. For each detection layer and at each spatial location , let the regression prediction be and the class prediction be . For bounding box decoding, is partitioned into four components , corresponding to the distance distributions for the top, bottom, left, and right sides of the bounding box, respectively. After processing via the mechanism associated with Distribution Focal Loss (DFL), these yield scalar distances , , , and , where represents the operation deriving a scalar distance from the learned distribution. The class probabilities are defined as , where denotes the sigmoid function.
Addressing the imbalance in pollen samples, this study adopts the Focal Loss (
) as the classification loss function. Focal Loss introduces a modulating factor
and an optional balancing factor
,
is the focusing parameter. If
is the model’s predicted probability for the
k-th class and
is the corresponding ground truth label (typically 0 or 1), the Focal Loss for the
k-th class is:
where
if
, and
if
. The total loss for object detection,
, is a weighted sum of the regression and classification losses, with weights
and
respectively:
Here,
is the sum of Focal Losses computed over all positive samples and selected negative samples. The bounding box regression loss (
) employs Distribution Focal Loss (DFL) to learn the distribution of distances from the anchor/reference point to the four sides of the bounding box. In all experiments, we combine DFL with Complete IoU (CIoU) loss as implemented in the Ultralytics framework, following recent one-stage detectors that adopt CIoU as the default choice for bounding box regression. Although other IoU variants such as GIoU and SIoU are supported by the underlying framework, we did not perform a systematic ablation over these alternatives in this work and therefore focus our analysis on the CIoU-based configuration. The classification loss (
), as previously mentioned, is calculated using Focal Loss.
In practice, we follow the Ultralytics implementation and adopt the default loss gains for the detection task, setting the contributions of the regression and classification components such that the box, classification, and DFL terms are weighted with gains of 7.5, 0.5, and 1.5, respectively. For the detection Transformer baselines that employ Focal Loss within a DETR-style objective, we use a focusing parameter of and a class-balancing factor of , consistent with common practice in long-tailed object detection. We observed that moderate variations around these values do not alter the relative ranking of models, so we report results under this well-established configuration without exhaustively tuning the loss hyperparameters.
In addition to Focal Loss, we also considered several alternative strategies for mitigating the long-tailed class imbalance in our pollen dataset, including simple inverse-frequency class re-weighting, class-balanced loss based on effective numbers of samples, and instance-level over-sampling of rare classes. However, in preliminary experiments these alternatives did not yield consistent improvements in mAP compared to the standard Focal Loss configuration and in some cases slightly degraded training stability. For this reason, and in line with common practice in modern one-stage detectors, we adopt Focal Loss as the final classification loss, while relying on careful dataset construction and augmentation to further alleviate imbalance effects.
3. Results and Discussion
3.1. Dataset Preparation
To facilitate the development and rigorous evaluation of HieraEdgeNet, we constructed a large-scale pollen detection dataset comprising 120 distinct pollen categories. The dataset was aggregated from two primary sources: (1) several existing, manually annotated pollen detection datasets, and (2) a novel data synthesis pipeline designed to convert a vast collection of single-grain classification images into detection samples with precise bounding box annotations. This synthesis strategy involves programmatically embedding individual pollen grain images onto authentic microscopy backgrounds, thereby substantially augmenting both the scale and diversity of the training data. The final dataset exhibits a characteristic long-tailed class distribution (
Figure 3), which closely mimics real-world scenarios. This feature is critical for training a high-performance model that is robust in practical applications. For model training, the dataset was partitioned into a training set (80%) and a validation set (20%), maintaining a proportional class distribution based on the number of instances per category. The model was trained for 500 epochs. During training, we employed a suite of data augmentation techniques—including mosaic augmentation, Gaussian blur, and color space shifting—to enhance the model’s adaptability and generalization to complex, unseen target domains.In addition, we performed a cross-dataset validation protocol, in which the model is trained on a subset of the aggregated datasets and evaluated on the remaining held-out datasets; the observed performance trends are consistent with those reported on the full benchmark, suggesting that HieraEdgeNet can generalize reasonably well across different pollen imaging environments.
Unless otherwise noted, all detectors considered in this study were trained using the Ultralytics detection framework with a one-cycle learning rate schedule. We set the initial learning rate to 0.01 and annealed it to a final value of over 500 epochs, including 3 warm-up epochs, using stochastic gradient descent with momentum 0.9 and weight decay of . The training process was executed on a high-performance computing node equipped with four NVIDIA V100 GPUs (NVIDIA Corp., Santa Clara, CA, USA), with a global batch size fixed at 16. Data augmentation followed the default Ultralytics detection pipeline, including HSV color jitter (hue = 0.015, saturation = 0.7, value = 0.4), random scaling in the range [0.5, 1.5], random translation up to 0.1 of the image size, horizontal flipping with probability 0.5, and mosaic composition with probability 1.0, while mixup and copy-paste augmentations were disabled. The complete YAML configuration files and training scripts used for all experiments are released in our open-source repository, which integrates HieraEdgeNet into the Ultralytics training system and allows the reported results to be reproduced with a single training command.
3.2. Quantitative Evaluation and Benchmarking
To systematically evaluate the performance of our proposed HieraEdgeNet architecture, we conducted a comprehensive set of experiments, benchmarking it against several state-of-the-art (SOTA) real-time object detectors. These baseline models include the CNN-based YOLOv11n [
47] and YOLOv12n [
14], as well as the Transformer-based RT-DETR-R18 and RT-DETR-R50 [
48]. Furthermore, we conducted two ablation studies on HieraEdgeNet. First, we created a hybrid model by integrating its backbone with the detection head of RT-DETR (designated HieraEdgeNet-RT-DETR) to validate the versatility and feature extraction capability of its backbone. Second, we applied the Layer-wise Automated Model Pruning (LAMP) structured pruning technique [
49] to create a compressed variant (HieraEdgeNet-LAMP), aiming to explore its deployment potential in resource-constrained scenarios. Our evaluation employs standard COCO metrics, including mAP@0.5, mAP@0.75, and the primary metric mAP@0.5:0.95, alongside a comprehensive assessment of model parameters, computational complexity (GFLOPs), and inference speed (FPS).During validation and test-time evaluation, all detectors are post-processed using the standard greedy IoU-based NMS provided by the Ultralytics framework. Unless otherwise specified, we set the NMS IoU threshold to 0.7 and use a low base confidence threshold of 0.001 to collect candidate boxes for COCO-style mAP computation and for generating precision–recall and F1 curves. For practical deployment, we recommend using the operating point corresponding to the peak F1 score (confidence threshold 0.43 as shown in
Figure 4). We did not enable soft-NMS or other alternative suppression schemes in our experiments, as preliminary checks did not reveal consistent benefits for the spatial densities observed in our pollen dataset.
We evaluate model efficiency using GFLOPs to quantify theoretical computational complexity and FPS to measure practical inference throughput. While a threshold of 30 FPS is typically sufficient for real-time applications, HieraEdgeNet achieves 361.17 FPS—and its pruned variant 403.24 FPS—at a resolution of
on a single NVIDIA V100 GPU (NVIDIA Corp., Santa Clara, CA, USA) (
Table 2). Surpassing the real-time benchmark by an order of magnitude, these results confirm that the enhanced feature extraction modules (Omni-Kernel and multi-scale fusion) do not compromise the system’s suitability for high-throughput microscopic analysis. We report parameters, GFLOPs, and FPS as hardware-agnostic indicators of computational cost, and deliberately refrain from summarizing training in terms of a single “GPU-hours” figure, since such a number depends strongly on specific hardware, cluster scheduling, and power-management policies and would not be directly comparable across deployment scenarios.
In practice, we follow the LAMP framework by computing layer-wise sensitivity scores for all convolutional channels and pruning those with the lowest scores under a global sparsity target. The pruning thresholds are tuned on a held-out validation subset to achieve an approximate 17% reduction in computational cost (from 14.0 to 11.6 GFLOPs) while constraining the drop in mAP@0.5:0.95 to less than 1.0 percentage point (0.8444 to 0.8363), thereby balancing compression and accuracy for the HieraEdgeNet-LAMP variant.
The quantitative results, presented in
Table 2, clearly demonstrate that our proposed HieraEdgeNet achieves the highest overall detection accuracy among the evaluated models. Its mAP@0.5:0.95 score of 0.8444 surpasses the advanced, similarly-sized models YOLOv12n and YOLOv11n by 1.29 and 2.09 percentage points, respectively. This significant accuracy gain provides strong evidence for the effectiveness of our designed HierarchicalEdgeModule, SynergisticEdgeFusion, and CSPOmni-Kernel modules in enhancing feature representation. Although this precision enhancement is accompanied by a moderate increase in computational cost (GFLOPs of 14.0 versus 6.7 for YOLOv12n), this trade-off is highly valuable given the stringent accuracy requirements of the pollen recognition task.
The advantages of HieraEdgeNet are even more pronounced when compared to the Transformer-based RT-DETR models. In terms of accuracy, HieraEdgeNet’s mAP@0.5:0.95 outperforms RT-DETR-R18 by 3.7 percentage points and RT-DETR-R50 by 5.48 percentage points. In terms of efficiency, HieraEdgeNet’s model size (3.88 M parameters/14 GFLOPs) is substantially lower than that of RT-DETR-R18 (20.03 M/57.5 GFLOPs) and RT-DETR-R50 (42.18 M/126.2 GFLOPs). This indicates that our architecture, through deep structural innovation within a CNN framework, achieves a superior accuracy–efficiency balance compared to leading Transformer-based detectors.When our backbone is integrated with the RT-DETR decoder, the resulting hybrid model (HieraEdgeNet-RT-DETR) achieves an mAP@0.5:0.95 of 0.8492—the highest of all models tested. This result compellingly demonstrates the high quality and generalizability of the features produced by our backbone, capable of providing robust support for diverse detector heads and surpassing the original RT-DETR-R18 and R50. Finally, addressing the practical need for lightweight models, our pruned HieraEdgeNet-LAMP variant achieves a remarkable mAP@0.5:0.95 of 0.8363 with significantly reduced parameters and computation. This accuracy not only exceeds that of the RT-DETR series but also both YOLOv11n and YOLOv12n, while maintaining a highly competitive inference speed (403.24 FPS). This showcases the excellent compressibility of the HieraEdgeNet architecture, its superior accuracy–efficiency curve, and its significant potential for practical applications.
The detection performance of HieraEdgeNet was further analyzed, as depicted in
Figure 4. The confusion matrix (
Figure 4a) exhibits a distinct and highly concentrated diagonal, indicating extremely high classification accuracy across the 46 pollen classes with minimal inter-class confusion. The Precision-Recall (P-R) curve (
Figure 4b) further corroborates the model’s exceptional performance, achieving a mean Average Precision (mAP) of 0.976 over all classes. The broad area under the curve demonstrates that the model maintains high precision across various recall levels. The F1-score curve (
Figure 4c) illustrates the model’s comprehensive performance at different confidence thresholds, reaching a peak F1-score of 0.938 at an optimal confidence threshold of 0.43. This provides a reliable basis for selecting the optimal operating point for the model in practical deployments. Baseline detectors are compared quantitatively in
Table 2 and
Table 3 rather than overlaid in
Figure 4, in order to avoid visual clutter and to keep the diagnostic plots focused on the proposed HieraEdgeNet model.
In summary, both the horizontal comparisons against state-of-the-art CNN and Transformer detectors and the vertical analyses of our model variants consistently affirm that the HieraEdgeNet architecture delivers superior precision and efficiency for automated pollen recognition through its deep mining and fusion of edge information and multi-scale features.
3.3. Feature Ablation
To validate the efficacy of the individual innovative components within HieraEdgeNet and to elucidate their interplay, we performed a comprehensive set of ablation studies (
Table 3). The results unequivocally demonstrate a profound synergistic effect among the three core modules: HEM, SEF, and CSPOKM. Notably, the isolated introduction of any single module or their partial combinations (Models B, C, and D) failed to outperform the baseline model (Model A).
More specifically, Model B, in which only HEM is enabled, shows that generating multi-scale edge maps
,
, and
is insufficient when these edge priors cannot be systematically fused into the semantic hierarchy due to the absence of SEF. In this configuration, the additional edge responses remain only weakly coupled to task-specific semantic features and may even over-emphasize background boundaries, so the detector is unable to convert them into consistent gains in mAP. Model D activates only CSPOKM on top of the baseline. Although the omni-kernel convolutions enrich the receptive field and refine local details, they operate on backbone representations that do not encode explicit edge–semantic fusion; consequently, the module primarily redistributes existing responses instead of introducing new boundary cues, which explains its limited standalone impact. SEF itself is designed as a fusion operator that aligns and reweights the semantic and edge streams. When all three modules are present in Model E, SEF projects the multi-scale edge priors extracted by HEM into the backbone semantic space and provides edge-aware inputs to CSPOKM, enabling the complete HieraEdgeNet architecture to fully exploit boundary information and achieve the substantial performance gain reported in
Table 3.
Conversely, only when all three modules operate in concert does the complete HieraEdgeNet architecture (Model E) exhibit a substantial performance gain, achieving an mAP@0.5:0.95 of 0.8444. This result markedly outperforms both the baseline (0.8315) and the state-of-the-art RT-DETR model (0.7896), thereby validating the holistic and innovative design of our proposed architecture.
3.4. Visual Analysis of Enhanced Edge Perception
Figure 5 presents a comparative visualization using Gradient-weighted Class Activation Mapping (Grad-CAM) [
50] heatmaps for four models—HieraEdgeNet, YOLOv12n, HieraEdgeNet-RT-DETR, and RT-DETR-R50—at both the backbone’s output layer and the detection head’s input layer. These heatmaps visually render the model’s attention intensity across the input image, with a specific focus on the edges and structural details of pollen grains.
Our analysis reveals that HieraEdgeNet generates more refined and sharply focused responses along the object boundaries, reflecting the HEM’s effective capture of edge information. This advantage is further amplified at the input of the detection head, where the representation of target edges is enhanced. In contrast, models like YOLOv12n and RT-DETR-R50 do not exhibit a comparable level of edge sensitivity, particularly at the detection head’s input layer, where their edge responses appear more diffuse. This suggests a deficiency in their explicit utilization of edge information, which may consequently limit their localization accuracy. In summary, the performance superiority of HieraEdgeNet is rooted in its ability to generate more precise and less noisy feature maps—a capability that is crucial for precise object localization within complex microscopic environments.
4. Conclusions
In conclusion, HieraEdgeNet establishes a new benchmark for automated pollen recognition. Beyond achieving a state-of-the-art mean Average Precision (mAP), our framework demonstrates significant computational efficiency. Notably, by strategically enhancing features with edge priors rather than relying on the computationally intensive self-attention mechanisms of Transformer-based models, HieraEdgeNet maintains high accuracy while reducing inference time, making it a viable solution for real-world, high-throughput applications. The measured throughput of over 360 FPS for the full model and over 400 FPS for the pruned variant at input resolution demonstrates that, on modern GPUs, HieraEdgeNet can comfortably satisfy real-time requirements in practical microscopic imaging pipelines. The efficacy of HieraEdgeNet is empirically substantiated by extensive experiments. In benchmark comparisons against state-of-the-art real-time object detectors, including YOLOv12n and RT-DETR, HieraEdgeNet achieved superior performance, significantly surpassing all baseline models on the critical mAP@0.5:0.95 metric. Furthermore, the HieraEdgeNet backbone demonstrated remarkable generalization capabilities when integrated with the RT-DETR decoder. Moreover, a structurally pruned version of the model retained high accuracy while exhibiting outstanding potential for practical deployment. Qualitative analysis using Grad-CAM visualizations further confirmed that our architecture generates feature responses that are more precisely focused and localized to object boundaries.
Despite its demonstrated high accuracy, HieraEdgeNet has limitations, primarily stemming from the substantial computational cost introduced by advanced components such as the Omni-Kernel. Additionally, its inherently two-dimensional design precludes the direct utilization of three-dimensional (Z-stack) microscopy data, and its performance may be challenged by domain shifts between training data and real-world samples. Consequently, our future work will focus on three key directions: (1) optimizing computational efficiency through techniques like model compression and acceleration; (2) extending the edge-enhancement framework to three dimensions to process volumetric data; and (3) employing domain adaptation methods to improve the model’s generalization and robustness across varied acquisition conditions.While the multi-source dataset construction and extensive data augmentation already provide some robustness to variations in illumination, contrast, and focus across laboratories, we acknowledge that larger domain shifts between training data and real-world deployments remain challenging, motivating our planned exploration of domain adaptation techniques.