HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition

Long, Yuchong; Sun, Wen; Sun, Ningxiao; Wang, Wenxiao; Li, Chao; Yin, Shan

doi:10.3390/agriculture15232518

Open AccessArticle

HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition

by

Yuchong Long

^1,2,3

,

Wen Sun

^2,4,

Ningxiao Sun

^1,2,

Wenxiao Wang

^1,2,

Chao Li

^1,2 and

Shan Yin

^1,2,*

¹

School of Agriculture and Biology, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai 200240, China

²

Shanghai Yangtze River Delta Eco-Environmental Change and Management Observation and Research Station (Shanghai Urban Ecosystem Research Station), Ministry of Science and Technology, National Forestry and Grassland Administration, 800 Dongchuan Road, Shanghai 200240, China

³

Institute of Eco-Chongming (IEC), Shanghai 202162, China

⁴

Shanghai Forestry Station, Hutai Rd. 1053, Shanghai 200072, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(23), 2518; https://doi.org/10.3390/agriculture15232518

Submission received: 29 October 2025 / Revised: 24 November 2025 / Accepted: 27 November 2025 / Published: 4 December 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Automated pollen recognition is a foundational tool for diverse scientific domains, including paleoclimatology, biodiversity monitoring, and agricultural science. However, conventional methods create a critical data bottleneck, limiting the temporal and spatial resolution of ecological analysis. Existing deep learning models often fail to achieve the requisite localization accuracy for microscopic pollen grains, which are characterized by their minute size, indistinct edges, and complex backgrounds. To overcome this, we introduce HieraEdgeNet, a novel object detection framework. The core principle of our architecture is to explicitly extract and hierarchically fuse multi-scale edge information with deep semantic features. This synergistic approach, combined with a computationally efficient large-kernel operator for fine-grained feature refinement, significantly enhances the model’s ability to perceive and precisely delineate object boundaries. On a large-scale dataset comprising 44,471 annotated microscopic images containing 342,706 pollen grains from 120 classes, HieraEdgeNet achieves a mean Average Precision of 0.9501 (mAP@0.5) and 0.8444 (mAP@0.5:0.95), substantially outperforming state-of-the-art models such as YOLOv12n and the Transformer-based RT-DETR family in terms of the accuracy–efficiency trade-off. This work provides a powerful computational tool for generating the high-throughput, high-fidelity data essential for modern ecological research, including tracking phenological shifts, assessing plant biodiversity, and reconstructing paleoenvironments. At the same time, we acknowledge that the current two-dimensional design cannot directly exploit volumetric Z-stack microscopy and that strong domain shifts between training data and real-world deployments may still degrade performance, which we identify as key directions for future work. By also enabling applications in precision agriculture, HieraEdgeNet contributes broadly to advancing ecosystem monitoring and sustainable food security.

Keywords:

pollen recognition; object detection; edge enhancement; multi-scale feature fusion; deep learning

1. Introduction

Palynology, the scientific study of pollen grains and other micropaleontological entities, is of critical importance across diverse scientific domains, including paleoclimate reconstruction [1], biodiversity monitoring [2,3], and allergenic disease research [4]. Crucially, it is also a vital tool for modern agricultural science. In this domain, the precise identification and quantification of pollen are fundamental to crop phenotyping, optimizing pollination ecology for yield enhancement, and monitoring the spread of pollen-borne pathogens. The type and concentration of airborne pollen are key indicators for assessing environmental quality, forecasting allergy seasons, and monitoring agricultural productivity [5,6]. However, conventional pollen analysis relies heavily on manual microscopic enumeration by expert palynologists. This process is not only time-consuming and labor-intensive but is also prone to subjective error, severely limiting the throughput and timeliness of data acquisition [7]. In typical monitoring setups, a trained palynologist can only process on the order of tens of slides per day, which is orders of magnitude below what is required for high-temporal-resolution phenological monitoring and large-scale biodiversity assessments. This limitation forms a critical data bottleneck, hindering our ability to monitor rapid ecological processes such as phenological shifts in response to climate change, the spread of invasive species, or real-time changes in plant biodiversity. Consequently, the development of automated, high-throughput, and high-precision pollen identification technologies has become a pressing demand in the field.

The advent of computer vision and deep learning has presented new opportunities for automated pollen identification [8]. Object detection, a fundamental yet critical task in computer vision, aims to identify and localize instances of specific classes within an image [9]. When applied to the analysis of microscopic images, particularly for pollen grain recognition, this task presents unique challenges. These challenges stem from the minute size of pollen grains, their embedding within complex backgrounds composed of dust and debris, their morphological diversity and tendency to aggregate, and their three-dimensional nature introduced by multi-focal plane imaging, which results in disparate appearances and sharpness for the same particle across different focal planes [7,10]. Early automated approaches primarily depended on traditional image processing techniques, such as color similarity transforms [11], active contour models [12], or feature engineering based on shape and texture [13]. Although these methods achieved some success on specific datasets, they often required tedious manual feature design, and their robustness and generalization capabilities were often suboptimal when faced with real-world samples characterized by complex backgrounds, pollen aggregation, or morphological variability [7,10]. Deep learning, particularly the emergence of Convolutional Neural Networks (CNNs), offered a powerful new paradigm to address these issues [8,9]. By automatically learning hierarchical features from data, CNN-based object detection models have achieved revolutionary breakthroughs. In recent years, one-stage detectors, striking an exceptional balance between speed and accuracy, have rapidly become the mainstream approach for real-time object detection [14]. These detectors frame the task as an end-to-end regression problem, thereby enabling real-time performance. In their architectural evolution, multi-scale feature representation is crucial for detecting objects of varying sizes. The Feature Pyramid Network (FPN), which constructs a feature pyramid via top-down pathways and lateral connections to effectively merge high-level semantic information with low-level spatial details, has become an indispensable component of modern detectors [15]. Subsequent architectures combine these pyramid features with post-processing components like Non-Maximum Suppression (NMS) or Transformer-based decoders [16,17] for final detection.

The field of pollen identification and classification has also seen widespread application of these advances, as demonstrated by Kubera et al. [18], who applied the YOLOv5 model to detect three pollen types with high morphological similarity, demonstrating performance superior to other detectors like Faster R-CNN and RetinaNet. Tan et al. [19] also developed the PollenDetect system based on YOLOv5 for identifying pollen activity. These studies demonstrate the significant potential of deep learning models, especially the YOLO series, in automated palynology. However, the direct application of these general-purpose models to pollen identification is still confronted with distinct challenges. The minute size of pollen grains means they occupy a limited number of effective pixels in microscopic images [7]. Consequently, relying solely on high-level semantic information is insufficient for high-precision recognition, rendering the effective extraction and utilization of boundary information paramount. Although some works have explored edge detection as an auxiliary task [20], how to systematically and synergistically fuse these low-level edge cues with high-level semantic features within the network remains a critical and unresolved issue.

To this end, we design and propose HieraEdgeNet, a multi-scale, edge-enhanced framework specifically engineered for high-precision automated pollen identification. Our architecture builds upon the foundational concepts previously discussed but introduces deep customizations and innovations to address the specific challenges of pollen analysis. The core innovation of HieraEdgeNet lies in three synergistic modules. First, we designed the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale edge feature pyramid in the early stages of the network, corresponding to the semantic hierarchy of the backbone. Subsequently, we introduce the Synergistic Edge Fusion (SEF) module, responsible for the deep, synergistic fusion of these scale-aware edge cues with semantic information from the backbone at each respective scale. Finally, to achieve ultimate feature refinement at the most detail-rich feature level, we adapt and incorporate the Cross Stage Partial Omni-Kernel Module (CSPOKM) [21].Through these synergistic modules, HieraEdgeNet significantly enhances the detection and localization accuracy of minute, indistinct targets. This technological advancement offers a robust solution for high-throughput palynological analysis, thereby unlocking its full potential for applications in precision agriculture. Specifically, it provides the technical foundation for generating large-scale, high-resolution datasets essential for modern biodiversity monitoring, tracking ecosystem responses to climate change, and paleoecological reconstruction, while also enhancing precision agriculture through applications in pollination assessment and crop health monitoring, thus contributing to both ecological sustainability and food security. In summary, this work makes three main contributions. First, we propose HieraEdgeNet, a palynology-specific detector that explicitly leverages a hierarchical edge feature pyramid and edge–semantic fusion to address the indistinct boundaries and minute size of microscopic pollen grains. Second, we integrate a Cross Stage Partial Omni-Kernel module that concentrates large-kernel computation at the most detail-rich scale, yielding a favorable accuracy–efficiency trade-off compared with YOLO-based baselines while keeping the overall computational cost moderate. Third, we construct a unified large-scale pollen detection benchmark from multiple public sources and synthetic compositions, and we systematically benchmark HieraEdgeNet against both YOLO and detection Transformer models, thereby providing a rigorous and reproducible evaluation of its performance.

2. Materials and Methods

Because this work targets an interdisciplinary audience, including palynologists, ecologists, and agricultural scientists who may not be familiar with modern object-detection architectures, this section briefly reviews the key modules and datasets on which our design builds before presenting the proposed HieraEdgeNet architecture.

2.1. Efficient Network Modules Based on Cross Stage Partial Ideology

To control computational costs while increasing network depth, efficient network architecture design is paramount. CSPNet (Cross Stage Partial Network) [22] introduced an effective strategy that partitions the feature map entering a processing block along the channel dimension into two parts. One part undergoes deep transformation through the block, while the other is concatenated with the processed features via a short-circuit connection. This design reduces computational redundancy and enhances gradient propagation throughout the network, thereby improving model learning capacity and efficiency. In modern object detection architectures, the basic convolution operation is encapsulated within a standardized Conv block. This block conventionally combines a 2D convolution layer, a batch normalization layer, and an activation function in series. This Conv-BN-Act structure has been proven to effectively accelerate model convergence and improve training stability, while also providing a regularization effect [23]. The C3k2 module, a core component for deep feature extraction in our architecture, is based on the design principles of CSPNet and adheres to the C2f paradigm [24]. This module is typically configured with standard Bottleneck blocks, each containing two convolution layers. The output of each processing unit is concatenated with the output of the previous stage, achieving a dense feature aggregation akin to that in DenseNet [25]. Finally, the features from all branches are concatenated and passed through a 1 × 1 convolution for final feature fusion. This design not only minimizes computational redundancy but also facilitates network learning through enriched gradient paths. The Area-Attention C2f (A2C2f) module embeds a region-based attention mechanism (AAttn) into the efficient C2f structure, endowing the model with dynamic, context-aware capabilities while retaining the advantages of CNN’s local inductive bias [14]. The core of the A2C2f module is an ABlock, which comprises an AAttn module and a feed-forward network (MLP) with residual connections, structurally similar to a standard Encoder layer in a Vision Transformer (ViT) [26]. The AAttn module implements multi-head self-attention but can constrain the computation to specific regions of the feature map via an area parameter, enabling a trade-off between global attention and computational cost.

To enable the network to capture multi-scale contextual information from a single-scale feature map, the Spatial Pyramid Pooling - Fast (SPPF) module was introduced as an efficient variant [27,28]. It employs a series of max-pooling layers to equivalently simulate the effect of parallel large-kernel pooling. An input feature map first undergoes channel reduction via a 1 × 1 convolution. Subsequently, this reduced feature map is passed serially through a max-pooling layer with a fixed kernel size (e.g., k = 5) and a stride of 1, three consecutive times. The initial down-sampled feature is then concatenated along the channel dimension with the outputs from the three serial pooling operations. Since two consecutive kxk pooling operations (with stride 1) have a receptive field approximately equivalent to a single

(2 k - 1) \times (2 k - 1)

kernel, and three operations to a

(3 k - 2) \times (3 k - 2)

kernel, SPPF with k = 5 serially applied three times can effectively capture receptive fields similar to those from parallel 5 × 5, 9 × 9, and 13 × 13 kernels in traditional SPP. Finally, the concatenated features are fused and their channels are adjusted by another 1 × 1 convolution. This serial design maintains multi-scale context awareness while significantly reducing computational complexity and improving inference speed.

2.2. Feature Extraction and Complex Object Optimization

The design of FPN enables modern object detectors to extract features at multiple scales, which is vital for detecting pollen grains of different sizes and locations. However, the limited resolution of microscopic images often results in objects that are minuscule with indistinct edges. Relying solely on standard multi-scale feature fusion is often insufficient for achieving precise detection in such contexts [10]. Consequently, enhanced strategies tailored to specific object characteristics are necessary. The identification and classification of pollen grains depend on both the boundary delimiting the grain from its environment and its internal features, making it crucial to enhance the model’s representation capability for indistinct objects and edge details.

For specialized tasks, introducing prior knowledge as a strong inductive bias can significantly improve model performance and convergence speed. In the recognition of objects like pollen grains, precise boundary information is critical. The SobelConv module facilitates this by explicitly injecting first-order gradient information—i.e., edge features—into the network. A SobelConv module is essentially a fixed-weight 2D convolutional layer. Its kernel weights are initialized with the classic Sobel operators for computing horizontal and vertical image gradients [29]. After initialization, these weights are frozen to ensure operational consistency. In CNNs, standard downsampling operations like strided convolutions or pooling layers inevitably cause spatial information loss, which is particularly detrimental to the detection and structural analysis of fine-grained objects like pollen. Although one could in principle relax this constraint and let the gradient kernels be learned end-to-end, we intentionally keep the Sobel filters fixed to provide a stable and interpretable edge prior while keeping the number of trainable parameters in the low-level edge extractor minimal. In the HEM, the subsequent

1 \times 1

convolutions and fusion modules (SEF and CSPOKM) remain fully learnable, so the network can still adaptively reweight and combine these edge responses with semantic cues. This design follows the common practice of embedding hand-crafted operators as inductive bias in modern CNNs, and we leave a systematic exploration of learnable gradient kernels within HEM as promising future work.

In our design, SobelConv is intentionally implemented as a fixed, non-learnable operator that injects a stable and interpretable edge prior into the earliest feature extraction stage. This follows the long tradition of using small discrete gradient masks such as the Sobel operator as efficient and robust first-order edge detectors in image analysis [30]. Our choice is motivated by three considerations: (1) it keeps the low-level edge extractor lightweight in terms of trainable parameters, which is beneficial for compact detectors; (2) it tends to improve optimization stability under the long-tailed and relatively limited pollen data distribution; and (3) Sobel filters have been successfully used to highlight salient intensity transitions along particle and pollen grain boundaries in bright-field and electron microscopy, including automated analyses of grass pollen surface ornamentation and bright-field pollen images [31,32,33]. At the same time, we explicitly acknowledge that this constitutes a strong inductive bias and may limit the ability of HEM to capture atypical edge patterns that deviate substantially from Sobel-like gradients, especially for rare or highly irregular pollen morphologies. In this work, we rely on the subsequent learnable

1 \times 1

convolutions in HEM and the downstream SEF and CSPOKM modules to adapt these fixed edge responses to task-specific structures, and we leave a systematic ablation comparing fixed SobelConv with a simple learnable

3 \times 3

convolution as an interesting direction for future work.

To achieve lossless feature map downsampling, the Space-to-Depth Convolution (SPD-Conv) module was proposed [34]. The core of this module is the “Space-to-Depth” transformation, also known as “Pixel Unshuffle.” This operation rearranges a spatial block of 2 × 2 pixels into the channel dimension, effectively halving the spatial resolution while quadrupling the channel depth, thus achieving downsampling without information loss.

Traditional convolutional networks rely on a homogeneous paradigm of stacking small-kernel convolutions. To mine and transform input features from omnidirectional, multi-dimensional, and multi-domain perspectives, the Omni-Kernel Core operator was developed [35,36]. The core of this module simulates diverse receptive fields through a set of parallel and heterogeneous depth-wise separable convolutional kernels. Within the Omni-Kernel, Multi-Scale and Anisotropic Convolutions employ various kernel shapes in parallel [37], including a large square kernel (e.g., 31 × 31) for capturing broad context and multiple anisotropic strip kernels (e.g., 1 × 31 and 31 × 1). These anisotropic kernels are particularly effective for perceiving structures with specific orientations, such as pollen contours or textures, thus providing an “omnidirectional” perceptual capability. The Omni-Kernel also integrates attention mechanisms from both the spatial and frequency domains. Frequency Channel Attention (FCA) first learns channel attention weights in the spatial domain via global average pooling and 1 × 1 convolutions, then applies these weights to the Fourier spectrum of the feature map [38]. This allows the model to dynamically adjust channel importance based on frequency composition. On the features processed by FCA, Spatial Channel Attention (SCA) further applies a classic Squeeze-and-Excitation type of channel attention for secondary calibration in the spatial domain [39]. The Frequency Gating Module (FGM) acts as a dynamic frequency-domain filter, selectively enhancing or suppressing specific frequency components via a learned gating mechanism, further improving the model’s adaptability to complex patterns [40]. The Omni-Kernel design is intended to enhance the expressive power for complex pollen morphologies, improve discrimination of subtle inter-class differences, and increase localization accuracy amidst complex backgrounds and ambiguous boundaries.

2.3. Datasets for Pollen Recognition

The development of robust, automated pollen recognition systems is critically dependent on large-scale, high-quality, and meticulously annotated datasets. Publicly available resources for pollen analysis, summarized in Table 1, can be broadly categorized into two main types: those comprising pre-segmented, single-grain images designed for classification tasks, and those providing fully annotated microscopic scenes for object detection. These datasets collectively offer a diverse range of pollen types, imaging conditions, and annotation complexities, providing a comprehensive foundation for benchmarking. This study leverages a selection of these public datasets, aggregated into a unified large-scale benchmark comprising 44,471 images and 342,706 annotated instances, to evaluate the performance of the HieraEdgeNet architecture, aiming to advance the accuracy, robustness, and usability of automated pollen recognition. Taken together, these architectural and dataset preliminaries position HieraEdgeNet between efficient YOLO-style one-stage detectors and detection Transformers such as RT-DETR: our model retains the favorable inductive biases and efficiency of CNN-based backbones while introducing explicit multi-scale edge priors and Omni-Kernel-based feature refinement tailored to the microscopic pollen detection problem.

To address the challenges inherent in detecting pollen grains—characterized by their minute size, limited effective pixel count, and the critical reliance of precise localization on the effective differentiation of target edges from complex environmental backgrounds—we introduce HieraEdgeNet. This novel architecture is engineered to substantially enhance the recognition and localization accuracy of small objects, with a particular focus on pollen grains, by employing hierarchical and synergistic perception of edge information integrated with multi-scale feature fusion. The core innovation of HieraEdgeNet lies in the introduction of three pivotal structural modules: the Hierarchical Edge Module (HEM), the Synergistic Edge Fusion (SEF) module, and the Cross Stage Partial Omni-Kernel Module (CSPOKM) [21]. These modules operate synergistically to form a network highly sensitive to edge information and endowed with robust feature representation capabilities. The architectural details of these three core modules are illustrated in Figure 1.

2.4. Hierarchical Edge Module (HEM) for Multi-Scale Feature Extraction

HEM is engineered to extract and construct a pyramid of multi-scale edge information from input feature maps, thereby explicitly capturing fine-grained image details and salient edge characteristics. We denote feature maps from stage k of a network backbone as

P k

; these are typically characterized by a spatial resolution downscaled by a factor of

2^{k}

relative to the input image (e.g.,

P 2

features are at

1 / 4

resolution,

P 3

at

1 / 8

, and so forth). The HEM initially employs a fixed Sobel operator, implemented as a convolutional layer with non-learnable weights (SobelConv), to independently compute gradients for each channel of an input feature map,

X_{in}

. This

X_{in}

is assumed to be from an early backbone stage, such as

P 2

(i.e.,

1 / 4

resolution, e.g., with 256 channels), yielding an initial edge response map,

X_{e d g e}^{(0)}

, which preserves the spatial resolution and channel dimensionality of

X_{in}

. Subsequently, max-pooling layers (MaxPool2d, kernel size 2 and stride 2) are iteratively applied three times to progressively downsample

X_{e d g e}^{(0)}

, generating edge maps at successively lower resolutions:

X_{e d g e}^{(1)}

at

1 / 8

,

X_{e d g e}^{(2)}

at

1 / 16

, and

X_{e d g e}^{(3)}

at

1 / 32

relative to the original image input. Finally, these hierarchically generated edge maps (

X_{e d g e}^{(1)}

,

X_{e d g e}^{(2)}

, and

X_{e d g e}^{(3)}

) are each processed by an independent

1 \times 1

convolution, which adjusts the channel dimensions to produce the multi-scale edge features

E_{P 3}

(128 channels, derived from

X_{e d g e}^{(1)}

at

1 / 8

resolution),

E_{P 4}

(256 channels, from

X_{e d g e}^{(2)}

at

1 / 16

resolution), and

E_{P 5}

(512 channels, from

X_{e d g e}^{(3)}

at

1 / 32

resolution). These edge features (

E_{P 3}

,

E_{P 4}

,

E_{P 5}

) therefore share the same spatial resolutions as the corresponding backbone stages

P 3

,

P 4

, and

P 5

(i.e.,

1 / 8

,

1 / 16

, and

1 / 32

), which allows SEF to fuse them with the semantic features via straightforward channel-wise concatenation without any additional interpolation or resampling.

Let

X_{in}

denote an input feature map. The HEM performs the following operations:

\begin{matrix} X_{e d g e}^{(0)} & = SobelConv (X_{i n}) \\ X_{e d g e}^{(1)} & = MaxPool (X_{e d g e}^{(0)}) \\ X_{e d g e}^{(2)} & = MaxPool (X_{e d g e}^{(1)}) \\ E_{P 3} & = F_{1 \times 1}^{(C_{P 3})} (X_{e d g e}^{(0)}) \\ E_{P 4} & = F_{1 \times 1}^{(C_{P 4})} (X_{e d g e}^{(1)}) \\ E_{P 5} & = F_{1 \times 1}^{(C_{P 5})} (X_{e d g e}^{(2)}) \end{matrix}

(1)

where

SobelConv (\cdot)

denotes the Sobel edge extraction implemented via fixed-weight convolution,

MaxPool (\cdot)

represents max pooling with a kernel size of 2 and a stride of 2, and

F_{1 \times 1}^{(C_{target})} (\cdot)

signifies a 1 × 1 convolutional block that adjusts the feature map to

C_{target}

output channels.

E_{P 3}, E_{P 4}, and E_{P 5}

are the resulting multi-scale edge feature maps, with

C_{P 3}, C_{P 4}, and C_{P 5}

being their respective channel dimensions (128, 256, and 512, as specified above).

The HEM furnishes the backbone network with explicit edge cues at multiple abstraction levels, designed for fusion with corresponding semantic features. This mechanism significantly aids the model in achieving more precise localization of object boundaries and in comprehending fine-grained structural details.

2.5. Synergistic Edge Fusion (SEF) for Integrating Priors

SEF is engineered to integrate core semantic information, crucial for pollen classification, with salient edge cues essential for precise localization. SEF operates by concurrently processing semantic features extracted at three primary spatial scales within the network backbone (P3/8, P4/16, and P5/32, corresponding to feature maps downsampled by factors of 8, 16, and 32, respectively). These semantic features are then fused with the corresponding scale-specific edge features generated by the HEM.

Let

X_{m a i n}

denote the input semantic features and

X_{e d g e}

represent the input edge features from the corresponding scale. The symbol ⨁ signifies concatenation along the channel dimension. The operations within the SEF module are defined as follows:

\begin{matrix} X_{c a t} & = X_{m a i n} ⨁ X_{e d g e} \\ X_{m i d 1} & = F_{1 \times 1}^{(C_{o u t} / 2)} (X_{c a t}) \\ X_{m i d 2} & = F_{3 \times 3}^{(C_{o u t} / 2)} (X_{m i d 1}) \\ Y_{S E F} & = F_{1 \times 1}^{(C_{o u t})} (X_{m i d 2}) \end{matrix}

(2)

where

Y_{S E F}

is the fused output feature map of the SEF module. Here,

C_{o u t}

specifies the number of output channels for the SEF module. The application of SEF modules at successive stages of the network, each operating on features of decreasing spatial resolution due to backbone downsampling, facilitates a continuous and hierarchical process of feature fusion. This mechanism allows SEF to effectively meld semantic and edge information, thereby substantially enhancing the model’s proficiency in both the classification and precise localization of pollen grains.

2.6. Cross Stage Partial Omni-Kernel Module (CSPOKM) for Feature Refinement

The primary objective of the CSPOKM is to enhance the flexibility and expressive power of feature extraction through the strategic incorporation of Omni-Kernel capabilities within a cross-stage, partial processing framework. This module is designed to effectively capture multi-scale features by leveraging inputs from different network stages. Specifically, CSPOKM introduces a learnable ensemble of convolutional kernels via the Omni-Kernel, which are characterized by their spatial sharing and channel-wise distinctions, enabling the model to flexibly combine and adapt feature representations at various levels of abstraction. Within HieraEdgeNet, the CSPOKM embeds the Omni-Kernel—a sophisticated operator comprising diverse large-scale, multi-directional depth-wise convolutions, augmented by attention and gating mechanisms such as SCA, FCA, and a FGM—into a CSP architecture. This configuration is employed for feature refinement within a critical path of the detection head, specifically at the P3 level, aiming to enhance the detail and spatial precision of features by integrating multi-scale contextual information.

The operational flow begins by concatenating features from disparate pathways:

X_{P 2 S}

(derived from P2/4 features processed by an SPD-Conv layer, designed to reduce spatial dimensions while preserving feature information),

X_{P 3 B}

(the fused backbone features at the P3/8 level), and

X_{P 5 U}

(obtained by upsampling P5/32 features). Let the concatenated input features be

X_{i n} = X_{P 2 S} ⨁ X_{P 3 B} ⨁ X_{P 5 U}

. The subsequent operations within CSPOKM are detailed as follows:

\begin{matrix} X_{f e a t} & = F_{1 \times 1}^{(C_{1})} (X_{i n}) \\ (X_{o k m_i n}, X_{s k i p}) & = {Split}_{e} (X_{f e a t}) \\ X_{o k m_o u t} & = M_{O K M} (X_{o k m_i n}) \\ Y_{C S P O K M} & = F_{1 \times 1}^{(C_{2})} (X_{o k m_o u t} ⨁ X_{s k i p}) \end{matrix}

(3)

Here, ⨁ denotes concatenation along the channel dimension. The

Split e (\cdot)

function partitions the feature channels into two segments based on a ratio e (empirically set to

e = 0.25

in our architecture) and

1 - e

. The segment

X_{o k m_{i} n}

is processed by the Omni-Kernel module, denoted as

M_{O K M} (\cdot)

, yielding

X o k m_o u t

. This output is then concatenated with the skipped connection

X_{s k i p}

and further processed by a

1 \times 1

convolutional block to produce the final output

Y_{C S P O K M}

. Typically, to maintain channel consistency for feature refinement at a specific level,

C_{1}

and

C_{2}

are set equal to a target channel dimension

C_{t a r g e t}

, ensuring that the module’s input and output channel counts are concordant.

This sophisticated processing, particularly at critical feature levels rich in detail such as P3, endows the network with a powerful feature learning capability. It enables the capture of complex patterns and long-range dependencies that are challenging for conventional convolutional layers. Concurrently, the CSP architecture ensures computational efficiency. Such characteristics are particularly advantageous for improving performance in scenarios involving small objects or requiring fine-grained distinctions between classes.

Computationally, the CSPOKM utilizes a split ratio of

e = 0.25

and is applied exclusively at the P3 level to concentrate resources on small-object features. This design keeps the total inference cost at 14 GFLOPs. While higher than the YOLOv12n baseline, this overhead yields a decisive performance boost (mAP@0.5:0.95

0.8315 \to 0.8444

), outperforming much heavier Transformer-based models like RT-DETR-R18/R50. Additionally, the compressibility demonstrated by the LAMP variant (11.6 GFLOPs) indicates that the architecture effectively balances rich feature extraction with practical deployment constraints.

To make the price–performance trade-off of our design more transparent, we follow common practice in complexity-aware CNN and detector design [37,46] and explicitly profile the FLOP contributions of the three core modules using a

640 \times 640

input and the profiling tools provided in the Ultralytics framework. Relative to the 6.7 GFLOPs of the YOLOv12n baseline, HieraEdgeNet’s total 14.0 GFLOPs can be decomposed as follows: the backbone and detection head account for essentially the same cost as YOLOv12n, HEM and SEF together contribute less than one third of the additional 7.3 GFLOPs, and CSPOKM accounts for the remaining majority. This indicates that most of the extra computation is concentrated in the large-kernel omni-kernel branch at the P3 level, whereas the hierarchical edge extraction (HEM) and fusion (SEF) modules provide comparatively modest overhead relative to their impact on accuracy. A detailed per-layer FLOP breakdown and the corresponding profiling scripts are released in our PalynoKit (version 0.0.1) repository to facilitate reproducibility and further analysis by practitioners. In this study, all profiling tools and experiments are implemented using the PalynoKit software package (version 0.0.1).

2.7. Architectural Integration of HieraEdgeNet

The HieraEdgeNet architecture is engineered to improve object detection for targets defined by fine-grained boundary information, a challenge prevalent in microscopic imaging. Its core design principle is the parallel and synergistic processing of semantic and edge features. The comprehensive architecture of HieraEdgeNet is depicted in Figure 2.

The implementation involves three key stages. First, a HEM operates in parallel to the backbone’s initial stages, generating a pyramid of edge-maps that corresponds directly to the semantic feature hierarchy. Second, these scale-specific edge features are injected into the primary feature stream via SEF modules, creating a unified set of representations that encode both boundary and semantic information at each scale. Third, the network neck employs a Path Aggregation Network (PANet) architecture to facilitate robust, bidirectional feature aggregation across all levels of the now edge-enhanced pyramid. These deeply fused multi-scale features are then processed by the detection heads to yield final predictions, comprising bounding boxes, confidence scores, and class labels.

2.8. Detection Head and Loss Function

The detection architecture directly employs a decoupled head. During the forward propagation phase, for each detection level l, the input feature map

X_{l} \in R^{B \times C_{l} \times H_{l} \times W_{l}}

is processed independently by the regression and classification heads. The regression head outputs

P_{r e g}^{(l)} = F_{r e g}^{(l)} (X_{l}) \in R^{B \times 4 R \times H_{l} \times W_{l}}

, and the classification head outputs

P_{c l s}^{(l)} = F_{c l s}^{(l)} (X_{l}) \in R^{B \times N_{c} \times H_{l} \times W_{l}}

. During inference, predictions from all levels are concatenated and decoded. For each detection layer

l \in 1, \dots, L

and at each spatial location

(i, j)

, let the regression prediction be

P_{r e g}^{(l)} [i, j] \in R^{4 R}

and the class prediction be

P_{c l s}^{(l)} [i, j] \in R^{N_{c}}

. For bounding box decoding,

P_{r e g}^{(l)} [i, j]

is partitioned into four components

d_{t}, d_{b}, d_{l}, d_{r} \in R^{R}

, corresponding to the distance distributions for the top, bottom, left, and right sides of the bounding box, respectively. After processing via the mechanism associated with Distribution Focal Loss (DFL), these yield scalar distances

δ_{t} = D (d_{t})

,

δ_{b} = D (d_{b})

,

δ_{l} = D (d_{l})

, and

δ_{r} = D (d_{r})

, where

D (\cdot)

represents the operation deriving a scalar distance from the learned distribution. The class probabilities are defined as

p^{(l)} [i, j] = σ (P_{c l s}^{(l)} [i, j]) \in {[0, 1]}^{N_{c}}

, where

σ

denotes the sigmoid function.

Addressing the imbalance in pollen samples, this study adopts the Focal Loss (

L_{F L}

) as the classification loss function. Focal Loss introduces a modulating factor

{(1 - p_{t})}^{γ}

and an optional balancing factor

α_{t}

,

γ \geq 0

is the focusing parameter. If

p_{k}

is the model’s predicted probability for the k-th class and

{\hat{p}}_{k}

is the corresponding ground truth label (typically 0 or 1), the Focal Loss for the k-th class is:

L_{F L}^{(k)} = - α_{k} {(1 - p_{t, k})}^{γ} log (p_{t, k})

(4)

where

p_{t, k} = p_{k}

if

{\hat{p}}_{k} = 1

, and

p_{t, k} = 1 - p_{k}

if

{\hat{p}}_{k} = 0

. The total loss for object detection,

L_{D e t T o t a l}

, is a weighted sum of the regression and classification losses, with weights

λ_{r e g}

and

λ_{c l s}

respectively:

L_{D e t T o t a l} = λ_{r e g} L_{r e g} + λ_{c l s} L_{c l s}

(5)

Here,

L_{c l s}

is the sum of Focal Losses computed over all positive samples and selected negative samples. The bounding box regression loss (

L_{r e g}

) employs Distribution Focal Loss (DFL) to learn the distribution of distances from the anchor/reference point to the four sides of the bounding box. In all experiments, we combine DFL with Complete IoU (CIoU) loss as implemented in the Ultralytics framework, following recent one-stage detectors that adopt CIoU as the default choice for bounding box regression. Although other IoU variants such as GIoU and SIoU are supported by the underlying framework, we did not perform a systematic ablation over these alternatives in this work and therefore focus our analysis on the CIoU-based configuration. The classification loss (

L_{c l s}

), as previously mentioned, is calculated using Focal Loss.

In practice, we follow the Ultralytics implementation and adopt the default loss gains for the detection task, setting the contributions of the regression and classification components such that the box, classification, and DFL terms are weighted with gains of 7.5, 0.5, and 1.5, respectively. For the detection Transformer baselines that employ Focal Loss within a DETR-style objective, we use a focusing parameter of

γ = 1.5

and a class-balancing factor of

α = 0.25

, consistent with common practice in long-tailed object detection. We observed that moderate variations around these values do not alter the relative ranking of models, so we report results under this well-established configuration without exhaustively tuning the loss hyperparameters.

In addition to Focal Loss, we also considered several alternative strategies for mitigating the long-tailed class imbalance in our pollen dataset, including simple inverse-frequency class re-weighting, class-balanced loss based on effective numbers of samples, and instance-level over-sampling of rare classes. However, in preliminary experiments these alternatives did not yield consistent improvements in mAP compared to the standard Focal Loss configuration and in some cases slightly degraded training stability. For this reason, and in line with common practice in modern one-stage detectors, we adopt Focal Loss as the final classification loss, while relying on careful dataset construction and augmentation to further alleviate imbalance effects.

3. Results and Discussion

3.1. Dataset Preparation

To facilitate the development and rigorous evaluation of HieraEdgeNet, we constructed a large-scale pollen detection dataset comprising 120 distinct pollen categories. The dataset was aggregated from two primary sources: (1) several existing, manually annotated pollen detection datasets, and (2) a novel data synthesis pipeline designed to convert a vast collection of single-grain classification images into detection samples with precise bounding box annotations. This synthesis strategy involves programmatically embedding individual pollen grain images onto authentic microscopy backgrounds, thereby substantially augmenting both the scale and diversity of the training data. The final dataset exhibits a characteristic long-tailed class distribution (Figure 3), which closely mimics real-world scenarios. This feature is critical for training a high-performance model that is robust in practical applications. For model training, the dataset was partitioned into a training set (80%) and a validation set (20%), maintaining a proportional class distribution based on the number of instances per category. The model was trained for 500 epochs. During training, we employed a suite of data augmentation techniques—including mosaic augmentation, Gaussian blur, and color space shifting—to enhance the model’s adaptability and generalization to complex, unseen target domains.In addition, we performed a cross-dataset validation protocol, in which the model is trained on a subset of the aggregated datasets and evaluated on the remaining held-out datasets; the observed performance trends are consistent with those reported on the full benchmark, suggesting that HieraEdgeNet can generalize reasonably well across different pollen imaging environments.

Unless otherwise noted, all detectors considered in this study were trained using the Ultralytics detection framework with a one-cycle learning rate schedule. We set the initial learning rate to 0.01 and annealed it to a final value of

1 \times 10^{- 4}

over 500 epochs, including 3 warm-up epochs, using stochastic gradient descent with momentum 0.9 and weight decay of

5 \times 10^{- 4}

. The training process was executed on a high-performance computing node equipped with four NVIDIA V100 GPUs (NVIDIA Corp., Santa Clara, CA, USA), with a global batch size fixed at 16. Data augmentation followed the default Ultralytics detection pipeline, including HSV color jitter (hue = 0.015, saturation = 0.7, value = 0.4), random scaling in the range [0.5, 1.5], random translation up to 0.1 of the image size, horizontal flipping with probability 0.5, and mosaic composition with probability 1.0, while mixup and copy-paste augmentations were disabled. The complete YAML configuration files and training scripts used for all experiments are released in our open-source repository, which integrates HieraEdgeNet into the Ultralytics training system and allows the reported results to be reproduced with a single training command.

3.2. Quantitative Evaluation and Benchmarking

To systematically evaluate the performance of our proposed HieraEdgeNet architecture, we conducted a comprehensive set of experiments, benchmarking it against several state-of-the-art (SOTA) real-time object detectors. These baseline models include the CNN-based YOLOv11n [47] and YOLOv12n [14], as well as the Transformer-based RT-DETR-R18 and RT-DETR-R50 [48]. Furthermore, we conducted two ablation studies on HieraEdgeNet. First, we created a hybrid model by integrating its backbone with the detection head of RT-DETR (designated HieraEdgeNet-RT-DETR) to validate the versatility and feature extraction capability of its backbone. Second, we applied the Layer-wise Automated Model Pruning (LAMP) structured pruning technique [49] to create a compressed variant (HieraEdgeNet-LAMP), aiming to explore its deployment potential in resource-constrained scenarios. Our evaluation employs standard COCO metrics, including mAP@0.5, mAP@0.75, and the primary metric mAP@0.5:0.95, alongside a comprehensive assessment of model parameters, computational complexity (GFLOPs), and inference speed (FPS).During validation and test-time evaluation, all detectors are post-processed using the standard greedy IoU-based NMS provided by the Ultralytics framework. Unless otherwise specified, we set the NMS IoU threshold to 0.7 and use a low base confidence threshold of 0.001 to collect candidate boxes for COCO-style mAP computation and for generating precision–recall and F1 curves. For practical deployment, we recommend using the operating point corresponding to the peak F1 score (confidence threshold 0.43 as shown in Figure 4). We did not enable soft-NMS or other alternative suppression schemes in our experiments, as preliminary checks did not reveal consistent benefits for the spatial densities observed in our pollen dataset.

We evaluate model efficiency using GFLOPs to quantify theoretical computational complexity and FPS to measure practical inference throughput. While a threshold of 30 FPS is typically sufficient for real-time applications, HieraEdgeNet achieves 361.17 FPS—and its pruned variant 403.24 FPS—at a resolution of

640 \times 640

on a single NVIDIA V100 GPU (NVIDIA Corp., Santa Clara, CA, USA) (Table 2). Surpassing the real-time benchmark by an order of magnitude, these results confirm that the enhanced feature extraction modules (Omni-Kernel and multi-scale fusion) do not compromise the system’s suitability for high-throughput microscopic analysis. We report parameters, GFLOPs, and FPS as hardware-agnostic indicators of computational cost, and deliberately refrain from summarizing training in terms of a single “GPU-hours” figure, since such a number depends strongly on specific hardware, cluster scheduling, and power-management policies and would not be directly comparable across deployment scenarios.

In practice, we follow the LAMP framework by computing layer-wise sensitivity scores for all convolutional channels and pruning those with the lowest scores under a global sparsity target. The pruning thresholds are tuned on a held-out validation subset to achieve an approximate 17% reduction in computational cost (from 14.0 to 11.6 GFLOPs) while constraining the drop in mAP@0.5:0.95 to less than 1.0 percentage point (0.8444 to 0.8363), thereby balancing compression and accuracy for the HieraEdgeNet-LAMP variant.

The quantitative results, presented in Table 2, clearly demonstrate that our proposed HieraEdgeNet achieves the highest overall detection accuracy among the evaluated models. Its mAP@0.5:0.95 score of 0.8444 surpasses the advanced, similarly-sized models YOLOv12n and YOLOv11n by 1.29 and 2.09 percentage points, respectively. This significant accuracy gain provides strong evidence for the effectiveness of our designed HierarchicalEdgeModule, SynergisticEdgeFusion, and CSPOmni-Kernel modules in enhancing feature representation. Although this precision enhancement is accompanied by a moderate increase in computational cost (GFLOPs of 14.0 versus 6.7 for YOLOv12n), this trade-off is highly valuable given the stringent accuracy requirements of the pollen recognition task.

The advantages of HieraEdgeNet are even more pronounced when compared to the Transformer-based RT-DETR models. In terms of accuracy, HieraEdgeNet’s mAP@0.5:0.95 outperforms RT-DETR-R18 by 3.7 percentage points and RT-DETR-R50 by 5.48 percentage points. In terms of efficiency, HieraEdgeNet’s model size (3.88 M parameters/14 GFLOPs) is substantially lower than that of RT-DETR-R18 (20.03 M/57.5 GFLOPs) and RT-DETR-R50 (42.18 M/126.2 GFLOPs). This indicates that our architecture, through deep structural innovation within a CNN framework, achieves a superior accuracy–efficiency balance compared to leading Transformer-based detectors.When our backbone is integrated with the RT-DETR decoder, the resulting hybrid model (HieraEdgeNet-RT-DETR) achieves an mAP@0.5:0.95 of 0.8492—the highest of all models tested. This result compellingly demonstrates the high quality and generalizability of the features produced by our backbone, capable of providing robust support for diverse detector heads and surpassing the original RT-DETR-R18 and R50. Finally, addressing the practical need for lightweight models, our pruned HieraEdgeNet-LAMP variant achieves a remarkable mAP@0.5:0.95 of 0.8363 with significantly reduced parameters and computation. This accuracy not only exceeds that of the RT-DETR series but also both YOLOv11n and YOLOv12n, while maintaining a highly competitive inference speed (403.24 FPS). This showcases the excellent compressibility of the HieraEdgeNet architecture, its superior accuracy–efficiency curve, and its significant potential for practical applications.

The detection performance of HieraEdgeNet was further analyzed, as depicted in Figure 4. The confusion matrix (Figure 4a) exhibits a distinct and highly concentrated diagonal, indicating extremely high classification accuracy across the 46 pollen classes with minimal inter-class confusion. The Precision-Recall (P-R) curve (Figure 4b) further corroborates the model’s exceptional performance, achieving a mean Average Precision (mAP) of 0.976 over all classes. The broad area under the curve demonstrates that the model maintains high precision across various recall levels. The F1-score curve (Figure 4c) illustrates the model’s comprehensive performance at different confidence thresholds, reaching a peak F1-score of 0.938 at an optimal confidence threshold of 0.43. This provides a reliable basis for selecting the optimal operating point for the model in practical deployments. Baseline detectors are compared quantitatively in Table 2 and Table 3 rather than overlaid in Figure 4, in order to avoid visual clutter and to keep the diagnostic plots focused on the proposed HieraEdgeNet model.

In summary, both the horizontal comparisons against state-of-the-art CNN and Transformer detectors and the vertical analyses of our model variants consistently affirm that the HieraEdgeNet architecture delivers superior precision and efficiency for automated pollen recognition through its deep mining and fusion of edge information and multi-scale features.

3.3. Feature Ablation

To validate the efficacy of the individual innovative components within HieraEdgeNet and to elucidate their interplay, we performed a comprehensive set of ablation studies (Table 3). The results unequivocally demonstrate a profound synergistic effect among the three core modules: HEM, SEF, and CSPOKM. Notably, the isolated introduction of any single module or their partial combinations (Models B, C, and D) failed to outperform the baseline model (Model A).

More specifically, Model B, in which only HEM is enabled, shows that generating multi-scale edge maps

E_{P 3}

,

E_{P 4}

, and

E_{P 5}

is insufficient when these edge priors cannot be systematically fused into the semantic hierarchy due to the absence of SEF. In this configuration, the additional edge responses remain only weakly coupled to task-specific semantic features and may even over-emphasize background boundaries, so the detector is unable to convert them into consistent gains in mAP. Model D activates only CSPOKM on top of the baseline. Although the omni-kernel convolutions enrich the receptive field and refine local details, they operate on backbone representations that do not encode explicit edge–semantic fusion; consequently, the module primarily redistributes existing responses instead of introducing new boundary cues, which explains its limited standalone impact. SEF itself is designed as a fusion operator that aligns and reweights the semantic and edge streams. When all three modules are present in Model E, SEF projects the multi-scale edge priors extracted by HEM into the backbone semantic space and provides edge-aware inputs to CSPOKM, enabling the complete HieraEdgeNet architecture to fully exploit boundary information and achieve the substantial performance gain reported in Table 3.

Conversely, only when all three modules operate in concert does the complete HieraEdgeNet architecture (Model E) exhibit a substantial performance gain, achieving an mAP@0.5:0.95 of 0.8444. This result markedly outperforms both the baseline (0.8315) and the state-of-the-art RT-DETR model (0.7896), thereby validating the holistic and innovative design of our proposed architecture.

3.4. Visual Analysis of Enhanced Edge Perception

Figure 5 presents a comparative visualization using Gradient-weighted Class Activation Mapping (Grad-CAM) [50] heatmaps for four models—HieraEdgeNet, YOLOv12n, HieraEdgeNet-RT-DETR, and RT-DETR-R50—at both the backbone’s output layer and the detection head’s input layer. These heatmaps visually render the model’s attention intensity across the input image, with a specific focus on the edges and structural details of pollen grains.

Our analysis reveals that HieraEdgeNet generates more refined and sharply focused responses along the object boundaries, reflecting the HEM’s effective capture of edge information. This advantage is further amplified at the input of the detection head, where the representation of target edges is enhanced. In contrast, models like YOLOv12n and RT-DETR-R50 do not exhibit a comparable level of edge sensitivity, particularly at the detection head’s input layer, where their edge responses appear more diffuse. This suggests a deficiency in their explicit utilization of edge information, which may consequently limit their localization accuracy. In summary, the performance superiority of HieraEdgeNet is rooted in its ability to generate more precise and less noisy feature maps—a capability that is crucial for precise object localization within complex microscopic environments.

4. Conclusions

In conclusion, HieraEdgeNet establishes a new benchmark for automated pollen recognition. Beyond achieving a state-of-the-art mean Average Precision (mAP), our framework demonstrates significant computational efficiency. Notably, by strategically enhancing features with edge priors rather than relying on the computationally intensive self-attention mechanisms of Transformer-based models, HieraEdgeNet maintains high accuracy while reducing inference time, making it a viable solution for real-world, high-throughput applications. The measured throughput of over 360 FPS for the full model and over 400 FPS for the pruned variant at

640 \times 640

input resolution demonstrates that, on modern GPUs, HieraEdgeNet can comfortably satisfy real-time requirements in practical microscopic imaging pipelines. The efficacy of HieraEdgeNet is empirically substantiated by extensive experiments. In benchmark comparisons against state-of-the-art real-time object detectors, including YOLOv12n and RT-DETR, HieraEdgeNet achieved superior performance, significantly surpassing all baseline models on the critical mAP@0.5:0.95 metric. Furthermore, the HieraEdgeNet backbone demonstrated remarkable generalization capabilities when integrated with the RT-DETR decoder. Moreover, a structurally pruned version of the model retained high accuracy while exhibiting outstanding potential for practical deployment. Qualitative analysis using Grad-CAM visualizations further confirmed that our architecture generates feature responses that are more precisely focused and localized to object boundaries.

Despite its demonstrated high accuracy, HieraEdgeNet has limitations, primarily stemming from the substantial computational cost introduced by advanced components such as the Omni-Kernel. Additionally, its inherently two-dimensional design precludes the direct utilization of three-dimensional (Z-stack) microscopy data, and its performance may be challenged by domain shifts between training data and real-world samples. Consequently, our future work will focus on three key directions: (1) optimizing computational efficiency through techniques like model compression and acceleration; (2) extending the edge-enhancement framework to three dimensions to process volumetric data; and (3) employing domain adaptation methods to improve the model’s generalization and robustness across varied acquisition conditions.While the multi-source dataset construction and extensive data augmentation already provide some robustness to variations in illumination, contrast, and focus across laboratories, we acknowledge that larger domain shifts between training data and real-world deployments remain challenging, motivating our planned exploration of domain adaptation techniques.

Author Contributions

Conceptualization, Y.L. and S.Y.; methodology, Y.L., S.Y. and W.S.; software, Y.L., W.W. and N.S.; validation, Y.L., S.Y., W.S., N.S., W.W. and C.L.; formal analysis, Y.L., N.S. and W.W.; investigation, Y.L., S.Y., W.S., N.S., W.W. and C.L.; resources, S.Y., W.S. and C.L.; data curation, Y.L., N.S. and W.W.; writing—original draft preparation, Y.L. and N.S.; writing—review and editing, S.Y., W.S., W.W., C.L. and Y.L.; visualization, Y.L., N.S., W.W. and W.S.; supervision, S.Y. and W.S.; project administration, S.Y., W.S. and C.L.; funding acquisition, S.Y. and W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was co-supported by the National Natural Science Foundation of China (32171865), the Shanghai Landscaping and City Appearance Administrative Bureau (G231208), and the Joint Research Fund of the “Island Atmosphere and Ecology” Category IV Peak Discipline (No. XS202501) at the Institute of Eco-Chongming (IEC).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used for training and evaluating the models in this study is publicly available on https://www.kaggle.com/datasets/ayinven/hieraedgenetintegratesdatasets (accessed on 30 September 2025). The models are deposited on the https://huggingface.co/datasets/AyinMostima/HieraEdgeNetintegratesdatasets (accessed on 30 September 2025). The source code for the complete pollen identification framework, PalynoKit, is available in the https://github.com/AyinMostima/PalynoKit (accessed on 30 September 2025).

Acknowledgments

The computations in this paper were performed on the

π

2.0, supported by the Center for High Performance Computing at Shanghai Jiao Tong University. We sincerely thank the center for their technical support and resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zargar, M.A.; Riyaz, M.; Hussain, S.M.; Nazir, A.; Hamid, S.; Gojree, I.A.; Parray, K.A. Pollen and Spores as Proxies for Palaeoenvironment Reconstruction: A Review of Sediment-Based Research. J. Palaeontol. Soc. India 2025, 70, 5529360251318072. [Google Scholar] [CrossRef]
Braun, K.; Haluza, D. Impact of Biodiversity Loss on Pollen Allergies: A Bibliometric Analysis. Sustainability 2024, 16, 9285. [Google Scholar] [CrossRef]
Dwarakanath, D.; Milic, A.; Beggs, P.J.; Wraith, D.; Davies, J.M. A Global Survey Addressing Sustainability of Pollen Monitoring. World Allergy Organ. J. 2024, 17, 100997. [Google Scholar] [CrossRef]
Jumayeva, Z.; Nozimova, A. Palyno-Morphological Study of Allergenic Flora of Samarkand, Uzbekistan. Am. J. Plant Sci. 2023, 14, 533–541. [Google Scholar] [CrossRef]
Berger, M.; Bastl, K.; Bastl, M.; Dirr, L.; Berger, U.E. Digital Health for Patients with Pollen Allergy: The Pollen App Experience and Beyond. In Digital Allergology; Springer Nature: Cham, Switzerland, 2025; pp. 49–61. [Google Scholar] [CrossRef]
Prodić, I.; Minić, R.; Stojadinović, M. The Influence of Environmental Pollution on the Allergenic Potential of Grass Pollen. Aerobiologia 2025, 41, 3–16. [Google Scholar] [CrossRef]
Gallardo-Caballero, R.; García-Orellana, C.J.; García-Manso, A.; González-Velasco, H.M.; Tormo-Molina, R.; Macías-Macías, M. Precise Pollen Grain Detection in Bright Field Microscopy Using Deep Learning Techniques. Sensors 2019, 19, 3583. [Google Scholar] [CrossRef]
Garga, B.; Abboubakar, H.; Sourpele, R.S.; Gwet, D.L.L.; Bitjoka, L. Pollen Grain Classification Using Some Convolutional Neural Network Architectures. J. Imaging 2024, 10, 158. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Shamrat, F.M.; Idris, M.Y.I.; Zhou, X.; Khalid, M.; Sharmin, S.; Sharmin, Z.; Ahmed, K.; Moni, M.A. PollenNet: A Novel Deep Learning Architecture for High Precision Pollen Grain Classification through Deep Learning and Explainable AI. Heliyon 2024, 10, e38888. [Google Scholar] [CrossRef]
Landsmeer, S.H.; Hendriks, E.A.; De Weger, L.A.; Reiber, J.H.; Stoel, B.C. Detection of Pollen Grains in Multifocal Optical Microscopy Images of Air Samples. Microsc. Res. Tech. 2009, 72, 424–430. [Google Scholar] [CrossRef] [PubMed]
Nguyen, N.R.; Donalson-Matasci, M.; Shin, M.C. Improving Pollen Classification with Less Training Effort. In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), Clearwater Beach, FL, USA, 15–17 January 2013; pp. 421–426. [Google Scholar] [CrossRef]
Redondo, R.; Bueno, G.; Chung, F.; Nava, R.; Marcos, J.V.; Cristóbal, G.; Rodríguez, T.; Gonzalez-Porto, A.; Pardo, C.; Déniz, O. Pollen Segmentation and Feature Evaluation for Automatic Classification in Bright-Field Microscopy. Comput. Electron. Agric. 2015, 110, 56–69. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs Beat Yolos on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar] [CrossRef]
Kubera, E.; Kubik-Komar, A.; Kurasiński, P.; Piotrowska-Weryszko, K.; Skrzypiec, M. Detection and Recognition of Pollen Grains in Multilabel Microscopic Images. Sensors 2022, 22, 2690. [Google Scholar] [CrossRef]
Tan, Z.; Yang, J.; Li, Q.; Su, F.; Yang, T.; Wang, W.; Aierxi, A.; Zhang, X.; Yang, W.; Kong, J.; et al. PollenDetect: An Open-Source Pollen Viability Status Recognition System Based on Deep Learning Neural Networks. Int. J. Mol. Sci. 2022, 23, 13469. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-Nested Edge Detection. arXiv 2015. [Google Scholar] [CrossRef]
Wei, W. YOLO-CSPOKM: Efficient Small Object Detector via CSP Omni-Kernel Model. In Proceedings of the 4th Asia-Pacific Artificial Intelligence and Big Data Forum, Ganzhou, China, 27–29 December 2024; Association for Computing Machinery: New York, NY, USA, 2025; pp. 716–721. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. arXiv 2019. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A Review on YOLOv8 and Its Advancements. In Data Intelligence and Cognitive Informatics; Jacob, I.J., Piramuthu, S., Falkowski-Gilski, P., Eds.; Springer Nature: Singapore, 2024; pp. 529–545. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R. Ultralytics/Yolov5: V3. 0. Zenodo 2020. [Google Scholar] [CrossRef]
Olorunshola, O.E.; Irhebhude, M.E.; Evwiekpaefe, A.E. A Comparative Study of YOLOv5 and YOLOv7 Object Detection Algorithms. J. Comput. Soc. Inform. 2023, 2, 1–12. [Google Scholar] [CrossRef]
Sobel, I.; Feldman, G. A 3x3 Isotropic Gradient Operator for Image Processing. Talk Stanf. Artif. Proj. 1968, 1968, 271–272. [Google Scholar]
Spontón, H.; Cardelino, J. A Review of Classic Edge Detectors. Image Process. Line 2015, 5, 90–123. [Google Scholar] [CrossRef]
Mander, L.; Li, M.; Mio, W.; Fowlkes, C.C.; Punyasena, S.W. Classification of grass pollen through the quantitative analysis of surface ornamentation and texture. Proc. R. Soc. B Biol. Sci. 2013, 280, 20131905. [Google Scholar] [CrossRef]
Haas, N.Q. Automated Pollen Image Classification. Master’s Thesis, University of Tennessee, Knoxville, TN, USA, 2011. [Google Scholar]
Barnacin, E.; Henry, J.; Nagau, J.; Molinié, J. Segmentation using multi-thresholded Sobel images: Application to the separation of stuck pollen grains. In Proceedings of the ICCV 2022: International Conference on Computational Vision, Venice, Italy, 16–17 June 2022. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Knoll, A. Omni-Kernel Network for Image Restoration. Proc. AAAI Conf. Artif. Intell. 2024, 38, 1426–1434. [Google Scholar] [CrossRef]
Kvalnes, Å.; Johansen, D.; van Renesse, R.; Schneider, F.B.; Valvag, S.V. Omni-Kernel: An Operating System Architecture for Pervasive Monitoring and Scheduling. IEEE Trans. Parallel Distrib. Syst. 2014, 26, 2849–2862. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up Your Kernels to 31x31: Revisiting Large Kernel Design in Cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 11963–11975. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency Channel Attention Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global Filter Networks for Image Classification. Adv. Neural Inf. Process. Syst. 2021, 34, 980–993. [Google Scholar] [CrossRef]
Astolfi, G.; Goncalves, A.B.; Menezes, G.V.; Borges, F.S.B.; Astolfi, A.C.M.N.; Matsubara, E.T.; Alvarez, M.; Pistori, H. POLLEN73S: An Image Dataset for Pollen Grains Classification. Ecol. Inform. 2020, 60, 101165. [Google Scholar] [CrossRef]
Khanzhina, N.; Filchenkov, A.; Minaeva, N.; Novoselova, L.; Petukhov, M.; Kharisova, I.; Pinaeva, J.; Zamorin, G.; Putin, E.; Zamyatina, E. Combating Data Incompetence in Pollen Images Detection and Classification for Pollinosis Prevention. Comput. Biol. Med. 2022, 140, 105064. [Google Scholar] [CrossRef] [PubMed]
Galan, C.; Oteros, J.; Damialis, A.; Kolek, F. Cretan Pollen Dataset v1 (CPD-1); Dataset; National Institute of Allergy and Infectious Diseases (NIAID). Zenodo 2021. [Google Scholar] [CrossRef]
Gregor, T.; Paule, J.; Zuber, P. Monospecific Mediterranean Pollen Images Dataset. Zenodo 2023. [Google Scholar] [CrossRef]
Cao, N.; Meyer, M.; Thiele, L.; Saukh, O. Automated Pollen Detection with an Affordable Technology. In Proceedings of the EWSN, Lyon, France, 17–18 February 2020; pp. 108–119. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024. [Google Scholar] [CrossRef]
Jia, D.; Yuan, Y.; He, H.; Wu, X.; Yu, H.; Lin, W.; Sun, L.; Zhang, C.; Hu, H. Detrs with Hybrid Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 8–22 June 2023; pp. 19702–19712. [Google Scholar] [CrossRef]
Yang, S.; Ding, C.; Huang, M.; Li, K.; Li, C.; Wei, Z.; Huang, S.; Dong, J.; Zhang, L.; Yu, H. LAMPS: A Layer-Wised Mixed-Precision-and-Sparsity Accelerator for NAS-optimized CNNs on FPGA. In Proceedings of the 2024 IEEE 32nd Annual International Symposium on Field-programmable Custom Computing Machines (FCCM), Orlando, FL, USA, 5–8 May 2024; pp. 90–96. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. The three core modules of HieraEdgeNet: (a) The HEM explicitly extracts and organizes multi-scale edge information from input features. (b) The SEF module effectively fuses scale-specific edge features, extracted by the HEM, with semantic features from the corresponding scale of the backbone network. (c) The CSPOKM introduces robust multi-scale and omnidirectional receptive field feature extraction, integrating diverse large-kernel, anisotropic depth-wise separable convolutions alongside spatial and frequency domain attention mechanisms.

Figure 2. The complete architecture of HieraEdgeNet, featuring enhancements for small objects, complex background information, and blurred edges.

Figure 3. The distribution of different pollen types in the dataset, highlighting the diversity and scale of the training samples.

Figure 4. Detailed performance validation of the proposed HieraEdgeNet model. (a) The confusion matrix for all pollen classes, demonstrating high classification accuracy and minimal inter-class confusion. (b) The Precision-Recall (P-R) curve, with the mean Average Precision (mAP) value over all classes indicated. (c) The F1 score curve as a function of the confidence threshold, highlighting the peak F1 score and the corresponding optimal threshold.

Figure 5. Visual analysis of HieraEdgeNet’s enhanced edge perception. Grad-CAM heatmaps compare the activation focus of HieraEdgeNet against baseline models at two key stages: the backbone output (top row) and the detection head input (bottom row). Warm colors (for example, red and yellow) indicate higher activation intensity, whereas cool colors (for example, blue) indicate lower activation. HieraEdgeNet shows a distinctly sharper and more accurate focus on the pollen grain edges, which is crucial for its superior performance.

Table 1. Datasets for Pollen Recognition.

Dataset	Abbreviation	Source
POLLEN73S	POLLEN73S	[41]
POLEN23E	POLEN23E	LifeCLEF 2023
POLLEN20L	POLLEN20L	[42]
ABCPollenOD	ABCPollenOD	[18]
Cretan Pollen Dataset v1 (CPD-1)	Cretan	[43]
Monospecific Mediterranean Pollen	Monospecific	[44]
Pollen Video Library	PollenVideo	[45]

Table 2. Quantitative performance evaluation of HieraEdgeNet and its variants against state-of-the-art detectors. The comparison includes metrics for model complexity (Parameters, GFLOPs, Model Size), inference speed (FPS), and detection accuracy (mAP@0.5, mAP@0.75, mAP@0.5:0.95) on the pollen dataset. The better results in each accuracy column are highlighted in bold.

Model	Parameters	GFLOPs	Model Size (MB)	FPS	mAP@0.5	mAP@0.75	mAP@0.5:0.95
HieraEdgeNet	3,882,080	14	7.8	361.17	0.9501	0.9304	0.8444
YOLOv12n	2,628,224	6.7	5.4	534.44	0.9478	0.9225	0.8315
YOLOv11n	2,653,648	6.7	5.4	577.77	0.9486	0.9215	0.8235
HieraEdgeNet-RT-DETR	73,096,064	211.7	140.4	49.39	0.9444	0.9262	0.8492
HieraEdgeNet-LAMP	2,688,579	11.6	5.6	403.24	0.9394	0.9189	0.8363
RT-DETR-R18	20,025,584	57.5	38.9	177.37	0.9201	0.8975	0.8074
RT-DETR-R50	42,181,284	126.2	164.7	73.61	0.9169	0.8886	0.7896

Table 3. Ablation study of the core components of HieraEdgeNet. Model A serves as the high-performance YOLOv12n baseline. Models B–D systematically assess the impact of incorporating HEM, SEF, and CSPOKM, either individually or in partial combinations. A checkmark (✓) signifies the presence of a module, whereas a cross (✗) indicates its absence. Model E represents the full HieraEdgeNet architecture integrating all three modules. For reference, results from a state-of-the-art model, RT-DETR (Model F), and another variant (Model G) are also presented.

Model	HEM	SEF	CSPOKM	Backbone	mAP@0.5	mAP@0.75	mAP@0.5:0.95
A (YOLO12n Baseline)	✗	✗	✗	YOLO12n	0.9478	0.9225	0.8315
B	✓	✗	✗	Hybrid	0.9457	0.9222	0.8243
C	✓	✓	✗	HieraEdge	0.9385	0.9116	0.8164
D	✗	✗	✓	YOLO12n	0.9391	0.9128	0.8258
E (HieraEdgeNet)	✓	✓	✓	HieraEdge	0.9501	0.9304	0.8444
F (RT-DETR)	✗	✗	✗	ResNet-50	0.9169	0.8886	0.7896
G (HieraEdgeNet-RTDETR)	✓	✓	✓	HieraEdge	0.9444	0.9262	0.8492

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Long, Y.; Sun, W.; Sun, N.; Wang, W.; Li, C.; Yin, S. HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition. Agriculture 2025, 15, 2518. https://doi.org/10.3390/agriculture15232518

AMA Style

Long Y, Sun W, Sun N, Wang W, Li C, Yin S. HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition. Agriculture. 2025; 15(23):2518. https://doi.org/10.3390/agriculture15232518

Chicago/Turabian Style

Long, Yuchong, Wen Sun, Ningxiao Sun, Wenxiao Wang, Chao Li, and Shan Yin. 2025. "HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition" Agriculture 15, no. 23: 2518. https://doi.org/10.3390/agriculture15232518

APA Style

Long, Y., Sun, W., Sun, N., Wang, W., Li, C., & Yin, S. (2025). HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition. Agriculture, 15(23), 2518. https://doi.org/10.3390/agriculture15232518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Efficient Network Modules Based on Cross Stage Partial Ideology

2.2. Feature Extraction and Complex Object Optimization

2.3. Datasets for Pollen Recognition

2.4. Hierarchical Edge Module (HEM) for Multi-Scale Feature Extraction

2.5. Synergistic Edge Fusion (SEF) for Integrating Priors

2.6. Cross Stage Partial Omni-Kernel Module (CSPOKM) for Feature Refinement

2.7. Architectural Integration of HieraEdgeNet

2.8. Detection Head and Loss Function

3. Results and Discussion

3.1. Dataset Preparation

3.2. Quantitative Evaluation and Benchmarking

3.3. Feature Ablation

3.4. Visual Analysis of Enhanced Edge Perception

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI