1. Introduction
Bees and other pollinating insects play an essential role in maintaining ecosystem stability and ensuring global food security. Previous studies have shown that approximately one-third of global food production is directly or indirectly dependent on animal-mediated pollination, among which bees make a particularly important contribution [
1]. Bee abundance is therefore a key quantitative indicator for evaluating pollination capacity, colony vitality, and ecological contribution. However, under the combined pressures of climate change, land use change, pesticide overuse, habitat degradation, and pathogen invasion, the health status of bee colonies has continued to decline worldwide, leading to increasing concern about the global pollination crisis [
2]. To address this challenge, efficient and accurate monitoring of bee colonies is urgently needed. Traditional monitoring methods mainly rely on manual observation and expert experience, which are subjective, time-consuming, labor-intensive, and difficult to reproduce under large-scale field conditions. Therefore, accurate, efficient, and deployable bee counting has become an important technical problem in precision beekeeping and ecological monitoring.
With the rapid development of information technology and artificial intelligence, computer vision and deep learning have become important tools for bee colony monitoring and bee behavior analysis [
3,
4]. A recent systematic review on precision beekeeping emphasized that the Internet of Things, embedded sensing, and intelligent monitoring technologies are driving the transformation of modern beekeeping, and these technologies have been applied to hive status prediction, intrusion or predator detection, pest and disease monitoring, colony health assessment, and apiary management [
5]. Among these applications, bee counting provides fundamental quantitative information for evaluating colony activity, population dynamics, pollination capacity, and management decisions. Benefiting from their strong feature representation capability, deep learning methods have been widely used in bee detection, classification, and behavior recognition [
6,
7,
8]. In addition, visual counting and trajectory analysis techniques provide more reliable data support for colony vitality assessment and ecological monitoring [
9,
10,
11,
12]. Density-map-based counting methods estimate a continuous density distribution from point annotations and obtain the total number of targets by integrating the predicted density map during inference [
13,
14]. Owing to their robustness to occlusion, small targets, and scale variation, density regression methods have become a mainstream paradigm for dense scene counting [
15,
16]. Nevertheless, two major bottlenecks remain in practical bee counting applications. First, these methods usually require a large number of precise point annotations, resulting in high annotation costs when images contain numerous densely distributed bees. Second, high-accuracy models often have relatively high computational complexity, making them difficult to deploy directly on resource-constrained embedded devices. Therefore, reducing annotation dependence through semi-supervised learning while maintaining real-time counting capability through lightweight model design is crucial for promoting the practical application of bee colony monitoring systems [
17,
18].
In recent years, semi-supervised learning has been widely adopted in dense object counting because of its ability to exploit unlabeled data and reduce annotation requirements [
19,
20,
21]. For example, in semi-supervised crowd counting, multi-task pseudo-label self-correction has been introduced to improve counting accuracy in dense scenes by constructing pseudo-label generation and self-correction mechanisms [
22]. Multi-representation consistency learning has also been used to supervise unlabeled data by enforcing consistency constraints among different density representations, thereby enhancing robustness and generalization ability [
23]. In addition, context-modeling-based semi-supervised methods combined with the Mean Teacher framework have been developed to guide pseudo-label generation for unlabeled samples and improve predictions in both sparse and dense regions of complex scenes [
24]. Related student–teacher approaches based on exponential moving average (EMA) have also been applied to fish recognition and detection tasks, where the teacher network generates pseudo-labels and the student network is trained with both labeled and unlabeled data [
25]. These studies indicate that semi-supervised or weakly supervised learning provides an effective way to mine useful information from unlabeled samples and reduce the dependence on dense annotations [
26,
27]. However, most existing semi-supervised counting methods are designed for general dense targets such as crowds and vehicles, and they are not specifically optimized for bee counting scenarios. Bees are small, densely clustered, visually similar, and easily affected by occlusion, illumination changes, and complex hive backgrounds. As a result, pseudo-density maps generated by the teacher model may contain noise accumulation and counting bias. Therefore, bee counting tasks still require more reliable pseudo-supervision constraints and task-specific feature representation mechanisms.
Meanwhile, with the development of edge computing and the Internet of Things, deploying intelligent models on embedded devices has become an important trend for achieving real-time and low-power monitoring [
28]. Lightweight backbone networks and model-compression strategies can significantly reduce inference latency and energy consumption on resource-constrained platforms. For example, MobileNetV3 provides a mobile-friendly architecture through neural architecture search and hardware-aware design, allowing models to maintain favorable performance on mobile and embedded devices [
29]. In real-time object detection, the YOLO series has demonstrated a good balance between accuracy and speed for edge visual tasks [
30]. Embedded platforms such as Jetson Nano and Jetson TX2 have also been widely used for low-latency vision applications, including counting and localization tasks [
28]. In addition, early-exit and branch-pruning strategies can further improve embedded inference efficiency [
28]. In beekeeping scenarios, recent edge device studies have explored tasks such as attention-integrated multi-scale predator detection for stingless bee protection and IoT-enabled honeybee monitoring for Varroa destructor detection using edge computing [
31,
32]. These studies demonstrate the potential of edge intelligence in precision agriculture and precision beekeeping. However, most existing edge intelligence studies in beekeeping focus on pest detection, predator detection, or hive status monitoring, whereas lightweight deployment for high-density bee counting remains insufficiently investigated.
In summary, existing studies have laid a solid foundation for bee visual monitoring, density-map-based counting, semi-supervised learning, and embedded deployment. However, several challenges remain unresolved. First, density-regression-based bee counting still depends heavily on dense point annotations, which limits its scalability in practical data collection. Second, bee images contain small, clustered, and visually similar targets under complex illumination and hive background conditions, requiring stronger fine-grained and multi-scale feature modeling. Third, practical beekeeping applications require not only accurate counting but also efficient inference on portable embedded devices. To address these gaps, this study proposes M3DANet, a lightweight semi-supervised density regression network for bee colony counting. The methodological novelty of M3DANet lies in its bee-counting-oriented integration of lightweight feature extraction, multi-scale context encoding, attention-guided low-level feature fusion, and confidence-masked teacher–student consistency learning. This design is intended to improve dense small-target representation, reduce dependence on fully labeled training data, and support real-time deployment on portable edge devices.
The main contributions of this study are summarized as follows:
A lightweight semi-supervised density regression network, named M3DANet, is proposed for bee colony counting. The network adopts MobileNetV3-Large as the backbone and introduces a multi-scale context encoding (MSCE) module, in which atrous spatial pyramid pooling (ASPP) is followed by a multi-scale dilated convolution (MSDC) block to enhance contextual representation at different receptive field scales. In addition, an attention-guided low-level fusion (AGLF) module is designed to fuse low-level spatial details with high-level semantic features, thereby improving the representation of small and densely distributed bee targets while maintaining low model complexity and high inference efficiency.
A semi-supervised teacher–student learning framework with confidence region constraints is developed to reduce the dependence on dense manual annotations. The teacher network is updated using exponential moving average (EMA), and the student network is optimized with both labeled and unlabeled samples. Pixel-level density consistency and global count consistency are jointly imposed to constrain pseudo-supervision. In addition, a confidence mask based on high-response regions and a warm-up strategy are introduced to suppress unreliable pseudo-label interference and improve training stability under low labeled data ratios.
A dataset and portable edge-monitoring system tailored for field bee counting applications are established. The dataset contains 586 high-resolution images and 34,869 point annotations, covering representative variations in bee density, shooting distance, illumination, viewing angle, and hive background conditions. A label-consistent preprocessing workflow, including coordinate calibration, size normalization, and Gaussian density map generation, is constructed to ensure consistency across the training, testing, and deployment stages. Furthermore, a self-designed handheld bee-monitoring device is developed, and real-scene deployment tests are conducted to verify the practical feasibility of the proposed method in field beekeeping environments.
The remainder of this paper is organized as follows.
Section 2 first presents the overall methodological framework of the proposed M3DANet-based bee colony counting system, and then describes the dataset construction and preprocessing workflow, the M3DANet network architecture, the feature extractor, the multi-scale context encoding module, the attention-guided low-level fusion module, the density regression head, the semi-supervised learning strategy, and the baseline implementation details.
Section 3 presents the experimental setup and evaluation metrics, followed by comparisons with representative supervised and semi-supervised methods, structural and semi-supervised component ablation experiments, cross-species generalization experiments, sensitivity analysis of key hyperparameters, edge deployment verification, and failure case analysis.
Section 4 discusses the main experimental findings; explains the contributions of the network architecture and semi-supervised learning strategy; analyzes the cross-species generalization results, deployment performance, and remaining limitations; and summarizes the practical significance of the proposed method. Finally,
Section 5 concludes the paper and outlines future research directions.
2. Materials and Methods
2.1. Overall Methodological Framework
Figure 1 presents the overall methodological framework of the proposed M3DANet-based bee colony counting system. The framework starts from bee images, point labels, and unlabeled data. During data preprocessing, the input images are resized or cropped, point annotations are converted into Gaussian density maps, and a fixed data split is adopted for training, validation, and testing. The processed samples are then fed into M3DANet, which consists of a MobileNetV3-Large-based lightweight backbone, a multi-scale context encoding module, an attention-guided low-level fusion module, and a density regression head. The predicted density map is further integrated to obtain the final bee count. During semi-supervised training, the teacher model is updated by exponential moving average, and confidence masks are used to constrain reliable regions for consistency learning. Finally, the trained model is deployed on a Jetson Orin NX-based handheld bee-counting device (NVIDIA Corporation, Santa Clara, CA, USA) for on-device inference, visualization, and real-time field counting.
2.2. Construction and Processing of the Dataset
The dataset used in this study was collected at the Science and Technology Innovation Park of Shandong Agricultural University in Shandong Province, China (117.1570° E, 36.1619° N), from June 2025 to March 2026. On-site images of bee colony activity were captured using a HUAWEI P50E smartphone (Huawei Technologies Co., Ltd., Shenzhen, China) under natural illumination. The original images had a resolution of 4096 × 3072 pixels. To reduce the effects of motion blur and unstable illumination, image acquisition was preferentially conducted during periods with stable weather and low wind speed. By adjusting the shooting angle, distance, and background conditions, bee colony images with different densities and scene characteristics were obtained, thereby enhancing the diversity and representativeness of the dataset. Manual image annotation was performed using VIA (VGG Image Annotator, version 2.0.12), and the generated CSV files recorded the image name and the x- and y-coordinates of each annotated point. The image-level ground-truth count was obtained by summing the annotated points in the corresponding CSV file. The final dataset contained 586 high-resolution images and 34,869 point annotations.
Figure 2 presents visualization examples of representative samples from the bee counting dataset. For each sample, the preprocessed image, point annotation visualization, and corresponding generated density map are shown in sequence.
To adapt to the density map regression task, this study standardizes the preprocessing of the original image and point annotations to ensure strict alignment between image pixels and annotation coordinates. Firstly, parse the CSV annotation file, map the annotation points to the pixel coordinate system of the corresponding image, and correct or remove invalid points with inconsistent scales, repeated annotations, or exceeding the image boundary. Secondly, perform proportional normalization on the image size: scale the short sides to no less than 1024 pixels, and limit the long sides to no more than 4032 pixels; the point coordinates transform synchronously with the image scaling, forming standardized image annotation sample pairs. Finally, a geometric adaptive Gaussian kernel density map is generated based on normalized point labeling, with kernel width adaptively determined by local neighbor distance. To ensure counting consistency, the generated density map undergoes mass conservation calibration to ensure that the sum of all pixel values is equal to the actual number of bees in the image, that is:
where
represents the value of the density map at position
, and
denotes the actual number of bees in the image.
The dataset was divided into a training set, a validation set, and a test set at a ratio of 7:2:1, resulting in 411 training images, 117 validation images, and 58 test images, which were used for model training, hyperparameter tuning, and final performance evaluation, respectively. All the labeled data ratio experiments used the same validation set and test set. Although point annotations were available for controlled evaluation, the semi-supervised settings were designed to simulate practical low-annotation scenarios, where densely distributed bees make exhaustive point annotation time-consuming and labor-intensive. Therefore, semi-supervised training was performed only within the training set. For the 10%, 30%, and 50% labeled data settings, the corresponding proportions of training images were selected as labeled samples, while the remaining training images were treated as unlabeled samples. The labeled and unlabeled subsets were mutually exclusive and together covered the entire training set.
During the training phase, the standardized high-resolution images were randomly cropped into 768 × 768 patches, and data augmentation strategies such as horizontal flipping were applied to the labeled samples. The corresponding continuous density maps were cropped in the same spatial regions as the input images, and were then block-summed and downsampled according to the network output stride to generate supervised density maps with the same output resolution. In the experimental setup of this study, when the input image patch size was 768 × 768, the final supervised density map size was 192 × 192. This discretization process keeps the sum of the density map unchanged, ensuring consistency between density regression supervision and global counting supervision. For unlabeled samples, this study constructs weakly and strongly enhanced views from the same region for semi-supervised teacher–student consistency learning. Strong enhancements include operations such as brightness, contrast, color perturbation, and random Gaussian blurring [
33]. In the validation and testing phases, random cropping is no longer performed, and only standardization processing is retained to ensure the stability and reproducibility of the evaluation results.
2.3. Design of M3DANet Network Architecture
For dense object counting in complex bee farm environments, this study proposes a semi-supervised density regression framework named M3DANet. The network structure is shown in
Figure 3. The framework consists of a lightweight student network and a teacher branch based on exponential moving average (EMA): the former is used to predict density maps, while the latter provides more stable pseudo-supervision signals for unlabeled samples, thereby enhancing learning performance under limited annotation conditions.
In terms of network architecture, M3DANet first adopts the first seven stages of MobileNetV3-Large as the backbone, including inverted residual bottlenecks, depthwise separable convolutions, squeeze-and-excitation (SE) modules, and lightweight nonlinear activation functions [
34]. This design reduces network depth, parameter size, and computational overhead while retaining sufficient shallow details and mid- to high-level semantic representation ability. The output of the fourth stage is used as a low-level feature map to preserve target edges, local textures, and spatial details, whereas the output of the seventh stage is used as a high-level feature map to provide stronger semantic and contextual representations. The high-level features are first mapped to 256 dimensions through a 3 × 3 convolution and are then sequentially fed into the atrous spatial pyramid pooling (ASPP) module and the multi-scale dilated convolution (MSDC) module to obtain a larger receptive field and richer multi-scale contextual information. Next, a lightweight attention module based on coordinate attention is utilized to enhance the high-level features by highlighting spatial responses related to target distribution. To improve the spatial detail restoration ability of the density map, the enhanced high-level features are upsampled to the resolution of the low-level features, concatenated with the low-level features after a 1 × 1 projection, and then fused and refined through two 3 × 3 convolution layers. Finally, a single-channel density map is output through a lightweight density regression head, and the final counting result is obtained by integrating the density map.
During training, the labeled samples are jointly optimized using the density regression loss and count constraint, whereas the unlabeled samples are constrained through density consistency and count consistency between weakly and strongly augmented views, thereby enhancing the model’s generalization ability under limited annotation conditions.
2.3.1. Feature Extractor
To balance feature representation and computational efficiency, this study uses the first seven feature blocks of MobileNetV3-Large as a lightweight feature extractor [
29]. As shown in
Figure 4, the feature extractor generates two-level features from the input image (3 × H × W). The first stage consists of a 3 × 3 standard convolution, batch normalization, and hard-swish activation, and performs initial downsampling with a stride of 2 to extract basic color, edge, and texture responses. The second stage uses a 3 × 3 depthwise separable residual block with a stride of 1 to refine shallow features while maintaining the spatial resolution. The third stage reduces the feature resolution to H/4 × W/4 through pointwise expansion, stride-2 depthwise convolution, and pointwise projection. The fourth stage further enhances low-level structural representations at the H/4 × W/4 resolution, and its output is selected as the low-level feature
to preserve bee boundaries, fine-grained textures, and local spatial details. The fifth stage introduces a 5 × 5 depthwise convolution and SE attention, and reduces the feature resolution to H/8 × W/8 to capture a wider range of local context. The sixth stage further aggregates mid-level semantic responses and suppresses redundant background information at the H/8 × W/8 resolution. The seventh stage continues to use 5 × 5 depthwise convolution and SE calibration to produce the high-level feature
, which provides stronger semantic and contextual representations for subsequent density regression. After backbone extraction, the high-level feature
is further processed by a 3 × 3 convolution followed by ReLU to obtain
with a size of 256 × H/8 × W/8, while the low-level feature
is retained for later fusion to enhance spatial detail recovery in density prediction.
2.3.2. Multi-Scale Context Encoding Module
In medium- to high-density bee colony scenes, local neighborhood textures and larger range flow patterns are equally important for counting. To this end, this study introduces a multi-scale context encoding (MSCE) module consisting of atrous spatial pyramid pooling (ASPP) and multi-scale dilated convolution blocks (MSDC) [
35,
36]. To this end, M3DANet has designed two modules in series on the output feature Fo of the backbone network, namely ASPP and multi-scale dilated convolution block (MSDC) [
35,
36], for explicitly modeling multi-scale contextual information and fine-grained aggregation of local structures. The overall structure is shown in
Figure 5.
Firstly, the ASPP module is mainly responsible for encoding global and large-scale contexts. As shown on the left side of
Figure 5, ASPP consists of four parallel branches: one branch is a 1 × 1 convolution used to preserve the original local response; the other three have expansion rates of d1 = 6, d2 = 12, and d3 = 18, respectively. The 3 × 3 dilated convolution with a dilation rate of 18 expands the receptive field without reducing spatial resolution, covering the overall motion pattern from a small gathering area near the entrance of the bee colony to a larger range. The output features of the four branches are concatenated in the channel dimension, and then linearly fused through a 1 × 1 convolution to obtain a unified multi-scale contextual feature:
In Equation (2), represents the multi-scale contextual features output by the ASPP module, which are obtained by concatenating and fusing a 1 × 1 convolution branch and three 3 × 3 dilated convolution branches with different dilation rates.
The coverage of receptive fields corresponding to different expansion rates can be illustrated in
Figure 6, where larger expansion rates can perceive the distribution patterns of bee colonies over a larger range.
On the basis of ASPP output
, to further enhance the modeling ability of local structures and high-density region details, M3DANet introduces a multi-scale dilated convolution block MSDC. Compared with ASPP which focuses on large receptive fields, MSDC focuses on finely characterizing the local geometric shape and occlusion relationship of bee colonies within a smaller expansion rate range [
35,
36]. As shown on the right side of
Figure 5, MSDC applies three parallel 3 × 3 dilated convolution branches to the same input feature, with dilation rates of r1 = 1, r2 = 2, and r3 = 3, respectively. Then, they are concatenated in the channel dimension and aggregated and compressed through 1 × 1 convolution:
In Equation (3), denotes the multi-scale local structural feature obtained after MSDC processing. It is generated by applying three parallel 3 × 3 dilated convolution branches with different dilation rates to for fine-grained local structure modeling, followed by channel-wise concatenation and 1 × 1 convolutional aggregation.
2.3.3. Attention-Guided Low-Level Fusion
Given the MSCE features
the network is further followed by an attention-guided low-level fusion (AGLF) module, composed of coordinate attention (CA) and low-level feature fusion [
37,
38], aiming to selectively enhance informative semantic responses while recovering fine spatial details.
Figure 7 shows the structure of AGLF.
Coordinate attention first performs global average pooling along the vertical and horizontal directions, thereby explicitly encoding contextual information along the row and column directions and obtaining two direction-aware feature descriptors. After concatenation along the spatial dimension, the aggregated representation is passed through a 1 × 1 convolution, batch normalization, and ReLU for channel compression and feature interaction, and is then split into two attention maps,
and
, corresponding to the height and width directions, respectively. Finally, coordinate attention recalibrates the input feature map through element-wise multiplication:
In Equation (4), represents the high-level semantic feature recalibrated by coordinate attention, which incorporates positional dependency information along both the horizontal and vertical directions.
This operation enhances spatially informative regions related to bee distribution while suppressing irrelevant background responses.
After coordinate attention, the enhanced high-level feature is first refined by a 3 × 3 convolution with ReLU, and then fused with the low-level feature extracted from an earlier backbone stage to recover fine-grained boundary and texture information. Specifically, the high-level feature is projected by a 1 × 1 convolution followed by batch normalization and ReLU, and then upsampled to the spatial resolution of the low-level feature. Meanwhile, the low-level feature is also projected to a unified channel dimension through another 1 × 1 convolution with batch normalization and ReLU. The two feature maps are then concatenated along the channel dimension and refined by two consecutive 3 × 3 convolution layers, producing the final fused feature
:
where
and
denote 1 × 1 convolutional projections followed by BN and ReLU, and
denotes the feature refinement function implemented by two stacked 3 × 3 convolutions.
denotes the feature obtained after fusing low-level detailed information, which is used for the subsequent density regression.
Overall, the AGLF module can be summarized as:
Among them, CA (·) and Fuse (·) respectively represent coordinate attention and low-level feature fusion operator.
2.3.4. Density Regression Head and Count Estimation
Based on the AGLF feature
, M3DANet employs a lightweight convolutional regression head to generate the final density map. Specifically, the regression head consists of a 3 × 3 convolution (128 → 64), followed by a ReLU activation, and a 1 × 1 convolution (64 → 1). A final ReLU operation is applied to ensure non-negative density output. The overall mapping can be written as:
Among them, denotes the regression head composed of two convolutional layers with an intermediate ReLU activation. Since the proposed network adopts low-level feature fusion, the predicted density map is generated at a spatial resolution of H/4 × W/4 under the current setting.
After obtaining the predicted density map
, the estimated number of bees in the entire image can be obtained by summing up all pixel positions:
Among them denotes the predicted count. Since the Gaussian density map used during training satisfies the mass conservation constraint, i.e., the integral of the density map is equal to the number of annotated points, the above summation remains naturally consistent with manual point-based counting during inference. At the same time, this operation introduces very little computational overhead, making it suitable for real-time deployment on embedded platforms.
2.4. Semi-Supervised Learning Strategy and Loss Functions
In this work, a teacher–student framework is adopted for semi-supervised learning. The student network is trained on a limited number of labeled samples and further constrained by consistency learning on unlabeled samples, whereas the teacher network is updated via exponential moving average and provides more stable pseudo-supervision signals for unlabeled data. Within this framework, the supervised component is used to optimize density map regression and overall counting accuracy, while the consistency component helps the model learn more stable density representations from unlabeled samples.
2.4.1. Supervised Loss of Labeled Samples
For labeled samples, the model is jointly optimized by a pixel-level density regression loss and an image-level count loss. Let a batch contain B labeled samples. For the b-th sample, the ground-truth density map is denoted by
, and the corresponding ground-truth bee count is denoted by
. The predicted density map is denoted by
. The predicted count can then be computed by summing the values over all spatial locations of the predicted density map, as follows:
The ground-truth density map is generated from the manual point annotations described using an adaptive Gaussian kernel and is further calibrated by mass conservation. The ground-truth count is the total number of valid manually annotated points in the corresponding image, which is also equal to the integral or discrete summation of over the whole image.
The density regression term is defined as the mean squared error between the predicted and ground-truth density maps:
To further constrain the global counting accuracy, a normalized count loss is introduced:
where
is a small constant for numerical stability, and
denotes the weight assigned to high-count samples, defined as
Accordingly, the supervised loss for labeled samples is formulated as
In Equation (13), and denote the weighting coefficients of the density regression loss and the count loss, respectively. The density regression loss and the count loss have different numerical scales because the former is computed from pixel-level density map regression, whereas the latter is computed from image-level count errors. Therefore, and are introduced to balance the magnitudes of the two loss terms during optimization. In the final experimental setting, = 20, = 0.01, = 1.0, = 0.1, and = 4.0. This loss formulation takes density regression as the main optimization objective, while introducing count supervision to improve global counting consistency. In addition, the normalized count error and the mild reweighting of high-count samples help stabilize the training process across scenes with different crowding levels.
2.4.2. Consistency Loss of Unlabeled Samples
When labeled data is limited, relying solely on them can lead to overfitting. To fully utilize the information of unlabeled images, a teacher–student framework was constructed: the student network S is used for gradient updates, and the teacher network shares the same structure, updating student parameters through exponential moving average (EMA) [
19]:
Among them, and are the parameters for students and teachers respectively, and is the EMA attenuation coefficient, which is set to 0.999 in this article.
For each unlabeled sample, the teacher network receives a weakly augmented view, while the student network receives a strongly augmented view. Let the corresponding density predictions be denoted by
and
, respectively. A confidence mask is constructed based on the teacher prediction, and consistency is imposed only on high-confidence regions:
where
is adaptively determined from the sample-wise quantile of the teacher-predicted density map. In this work, the quantile parameter is set to q = 0.90, meaning that only the high-response regions of the teacher prediction are used for consistency learning. To avoid excessively sparse supervision, the minimum valid-mask ratio is further set to 0.02.
Based on this mask, the unlabeled density consistency loss is defined as a masked mean squared error:
where
is a small constant introduced to avoid division by zero.
In addition to density map consistency, a count consistency constraint is also introduced. Let the teacher and student predicted counts be
Then, the count consistency loss for unlabeled samples is defined as
Thus, the final consistency loss for unlabeled samples is written as
2.4.3. Total Loss
Combining labeled and unlabeled data, the total loss of this work is:
Among them, α(e) is the semi-supervised weight that varies with epoch. Adopting a linear warm-up strategy in implementation [
20]:
In the final training setting,
,
. These settings were further examined through sensitivity analysis in
Section 3.5. This strategy allows the model to rely more on labeled supervision during the early stage of training and gradually increase the contribution of unlabeled consistency as training proceeds, so that pseudo-label information is introduced more stably after the teacher–student predictions become sufficiently reliable.
2.5. Baseline Methods and Implementation Details
To provide a fair and comprehensive comparison, representative baseline methods were selected from both semi-supervised counting and density regression counting models. For semi-supervised comparison, Dream [
39], Calibrating [
21], MTCP [
40], and MRC [
24] were used to evaluate counting performance under different labeled data ratios. These methods are closely related to semi-supervised density-map-based counting and are designed to reduce annotation dependency by exploiting unlabeled samples through pseudo-labeling, consistency regularization, uncertainty calibration, contextual modeling, or teacher–student learning strategies.
Dream is a semi-supervised crowd-counting method based on rank-consistent pyramid learning. It exploits the ranking relationship among density representations at different pyramid levels to improve the use of unlabeled data and enhance density map regression under limited annotations. Calibrating is an uncertainty-aware semi-supervised counting method that improves pseudo-label reliability by calibrating uncertainty estimation, thereby reducing the negative influence of noisy pseudo-supervision. MTCP, namely multi-task credible pseudo-label learning, introduces credible pseudo-label selection and multi-task learning into semi-supervised crowd counting. By improving the reliability of pseudo-labels and jointly optimizing related supervision signals, MTCP enhances counting robustness when only limited labeled samples are available. MRC is a semi-supervised crowd-counting method with contextual modeling. It aims to facilitate a more holistic understanding of dense scenes by modeling contextual information, thereby improving the model’s ability to learn from unlabeled samples in complex counting scenarios.
For lightweight and deployment-oriented comparison, MCNN [
15], TasselNetV2+ [
41], and CSRNet [
16] were selected. MCNN is a classical multi-column convolutional network for density-map-based counting. TasselNetV2+ is a compact regression-based counting model originally designed for plant counting tasks. CSRNet is a strong density regression baseline that uses dilated convolutions to enlarge the receptive field and has been widely used in dense object counting. These models were included to evaluate the accuracy–efficiency trade-off of M3DANet in terms of counting accuracy, parameter size, and inference speed.
All baseline models were evaluated on the same bee counting dataset using identical training, validation, and test splits. In the semi-supervised experiments, the validation and test sets were kept fixed across all labeled data ratios, while only the proportion of labeled samples in the training set was changed. The remaining training images were treated as unlabeled data. The same evaluation metrics, namely MAE and RMSE, were used for all methods.
3. Results
3.1. Experimental Setup
Most density-map-based counting studies adopt the mean absolute error (MAE) and root mean square error (RMSE) as standard evaluation metrics [
42]. MAE measures the average absolute deviation between the predicted count and the ground truth, whereas RMSE penalizes large errors more strongly and is therefore more sensitive to outliers. In this work, MAE and RMSE are employed to evaluate the performance of the proposed semi-supervised bee colony counting model.
For a test set containing
images, let
and
denote the predicted and ground-truth numbers of bees in the
i-th image, respectively. The MAE and RMSE are calculated as follows:
All the experiments were conducted in the same environment. The experimental environment and training configuration are as follows. The operating system is Ubuntu 20.04, with PyTorch 1.11.0, CUDA 11.3, and Python 3.8. The hardware platform includes an NVIDIA RTX 4090D GPU (24 GB; NVIDIA Corporation, Santa Clara, CA, USA), an Intel Xeon Platinum 8474C CPU (15 vCPUs; Intel Corporation, Santa Clara, CA, USA),and 80 GB RAM. The model is trained for 200 epochs with a batch size of 8 and a learning rate of 1.0 × 10−5. AdamW is used as the optimizer with weight decay of 1 × 10−4. Semi-supervised training followed a teacher–student framework, in which the teacher parameters were updated from the student parameters using EMA with a decay factor of 0.999. The consistency loss included a density consistency term and a count-consistency term, weighted by and , respectively. A 40-epoch warm-up strategy was adopted to gradually introduce the semi-supervised loss. These settings were used in the main experiments, and their influence was further evaluated in the sensitivity analysis.
All the comparative experiments followed a unified pipeline: input images were first resized proportionally and then randomly cropped to 768 × 768 during training; during validation and testing, sliding-window inference was performed with a 768 × 768 window and a stride of 384. Standard data augmentation, Gaussian density map supervision, AdamW weight decay, warm-up training, confidence masking, and fixed validation and test splits were used to improve training stability and reduce the risk of overfitting under limited labeled data.
3.2. Comparison of Counting Performance
As shown in
Table 1, M3DANet achieved an MAE of 5.201 and an RMSE of 6.989 under the fully supervised setting. Compared with MCNN and TasselNetV2+, M3DANet substantially reduced both the MAE and RMSE, showing stronger robustness in dense bee counting scenes. Compared with the strong density regression baseline CSRNet, M3DANet obtained a slightly lower MAE (5.201 vs. 5.298), while CSRNet achieved a marginally lower RMSE (6.856 vs. 6.989). Therefore, the main advantage of M3DANet in the fully supervised setting is not a uniform improvement in every error metric, but a stronger accuracy–efficiency balance: it used only 2.095 M parameters, reduced the parameter size by approximately 87.1% relative to CSRNet, and achieved a 17.7-fold higher inference speed.
Under the semi-supervised setting, M3DANet consistently obtained the lowest MAE and RMSE across the 10%, 30%, and 50% labeled data ratios. With only 10% labeled training images, M3DANet achieved an MAE of 9.937 and an RMSE of 13.093, outperforming Dream, Calibrating, MTCP, and MRC. At the 30% labeled data ratio, it further reduced the MAE to 7.003 and the RMSE to 9.387, remaining better than the closest baseline, MTCP. At the 50% labeled data ratio, M3DANet achieved the best semi-supervised result, with an MAE of 5.570 and an RMSE of 7.620. Compared with the strongest baseline at this ratio, MRC, the MAE and RMSE were reduced by 20.32% and 14.38%, respectively. These results indicate that the proposed teacher–student consistency strategy can effectively exploit unlabeled bee images and maintain stable gains under different annotation budgets.
From the annotation efficiency perspective, the MAE of M3DANet decreased from 9.937 to 7.003 and 5.570 as the labeled data ratio increased from 10% to 30% and 50%, and RMSE decreased from 13.093 to 9.387 and 7.620. The 50% semi-supervised setting was close to the fully supervised result, with only 0.369 higher MAE and 0.631 higher RMSE. Notably, using only 50% labeled training images, M3DANet already achieved substantially lower counting errors than the fully supervised MCNN and TasselNetV2+, and approached the performance of the strong fully supervised density regression baseline CSRNet. This suggests that M3DANet can use unlabeled data to recover much of the performance that would otherwise require additional manual point annotations.
Figure 8 further shows that the predicted-count curve becomes progressively closer to the ground-truth curve as the labeled data ratio increases. Although the 10% setting already captures the overall count trend, larger deviations remain in high-count images. These deviations are reduced at 30% and further narrowed at 50%, confirming that the semi-supervised framework benefits from additional labeled samples while still preserving strong performance under limited annotation.
As shown in
Figure 9, the further qualitative comparison indicates that, with the ground-truth count (GT) for this sample being 135, M3DANet produces more complete responses in bee-dense regions, reduces missed responses, and provides a predicted count closer to the ground truth than the other semi-supervised methods.
3.3. Ablation Experiments
3.3.1. Structural Ablation Experiment
The structural ablation results are shown in
Table 2. BackboneOnly, consisting of the MobileNetV3-Large backbone and a lightweight density regression head, was used as the baseline. On this basis, the multi-scale context encoding module (MSCE), the attention-guided low-level fusion module (AGLF), and their combination were introduced to evaluate their individual and joint effects on counting performance.
BackboneOnly achieved an MAE of 7.064 and an RMSE of 9.194. After introducing MSCE, the MAE and RMSE decreased to 6.455 and 8.761, corresponding to relative reductions of 8.62% and 4.71%, respectively. This indicates that multi-scale contextual encoding improves the representation of bee targets with different apparent sizes and enhances density regression in complex scenes.
When only AGLF was added, the MAE and RMSE further decreased to 5.922 and 7.501, with relative reductions of 16.17% and 18.41% compared with BackboneOnly. This result suggests that coordinate attention and low-level feature fusion are effective for preserving boundary cues, local textures, and spatial location information, thereby improving detail recovery in dense regions and reducing background interference.
When MSCE and AGLF were jointly enabled, the full M3DANet achieved the best performance, with an MAE of 5.201 and an RMSE of 6.989. Compared with BackboneOnly, the full model reduced MAE and RMSE by 26.37% and 23.98%, respectively. The paired Wilcoxon signed-rank tests showed that the full model significantly outperformed all three reduced structural variants (p < 0.05), confirming that MSCE and AGLF are complementary in multi-scale contextual modeling and low-level spatial detail restoration.
3.3.2. Semi-Supervised Component Ablation Experiment
To further verify the effectiveness of the semi-supervised learning strategy, an additional component ablation experiment was conducted under the 50% labeled data setting. The complete framework contains density consistency, count consistency, confidence masking, and the warm-up strategy. The results are reported in
Table 3.
The full semi-supervised framework achieved the best result, with an MAE of 5.570 and an RMSE of 7.620. When the consistency constraints were removed, the model degenerated into a supervised-only setting using labeled samples only, and the MAE and RMSE increased to 6.603 and 8.912, respectively. Compared with this setting, the full semi-supervised framework reduced the MAE and RMSE by 15.64% and 14.50%, respectively, indicating that unlabeled samples provide useful supplementary supervision through teacher–student consistency learning.
The single-consistency variants were also inferior to the full framework. The Count-only variant obtained an MAE of 6.801 and an RMSE of 9.223, whereas the Density-only variant obtained an MAE of 7.192 and an RMSE of 9.710. These results demonstrate the complementarity between pixel-level density consistency and image-level count consistency. Density consistency constrains the spatial distribution of the predicted density map, while count consistency constrains the global counting result.
Removing the confidence mask increased the MAE and RMSE to 7.004 and 9.426, respectively, showing that directly using all pseudo-density responses may introduce noise from low-confidence regions. Removing the warm-up strategy also increased the MAE and RMSE to 6.994 and 9.413, respectively, suggesting that gradually introducing the semi-supervised loss during early training prevents unstable teacher predictions from imposing excessive incorrect supervision. All reduced semi-supervised variants differed significantly from the full SSL framework in paired Wilcoxon tests (p < 0.05).
Figure 10 summarizes the ablation results. The structural ablation results show that the model error decreases progressively with the introduction of MSCE and AGLF, while the semi-supervised component ablation shows that removing any key component increases the error. These findings verify the effectiveness of both the proposed network structure and the semi-supervised training strategy.
3.4. Generalization Experiment
An independent self-built dataset was used to evaluate the generalization ability of the model across different honeybee species. The main dataset consists mainly of images of Italian bees, while the generalized dataset comes from Chinese bees, and the differences between the two species may lead to domain shift. Therefore, this dataset provides a suitable test case for the robustness of cross-species counting. The annotation and preprocessing process of the generalized dataset is consistent with that of the main dataset. This dataset contains a total of 504 images and 133,888 annotation points. The experimental configuration for generalization testing is the same as that used in the main experiment; therefore, network inputs, training strategies, inference processes, and evaluation metrics will not be repeated here.
Figure 11 shows a visualization example of a general dataset. The preprocessed image, point annotation visualization, and corresponding generated density map are displayed in sequence.
As shown in
Table 4, Bias denotes the mean signed prediction error and is used to identify whether a model tends to systematically overestimate or underestimate image-level counts. Under the 10% labeled setting, M3DANet achieved the lowest MAE and RMSE, which were 13.947 and 17.649, respectively. Compared with Dream, Calibrating, MTCP, and MRC, the MAE of M3DANet was reduced by 49.7%, 60.3%, 44.8%, and 14.7%, respectively. This indicates that M3DANet can use limited supervision more effectively and maintain stronger cross-species generalization on Chinese honeybee images.
Under the 30% labeled setting, M3DANet achieved an MAE of 11.945 and an RMSE of 14.078, showing the best performance among all semi-supervised methods at this labeled ratio. Compared with Dream, Calibrating, MTCP, and MRC, the MAE of M3DANet was reduced by 34.8%, 44.4%, 36.2%, and 8.8%, respectively. Under the 50% labeled setting, the MAE and RMSE of M3DANet were 11.772 and 15.893, respectively, remaining lower than those of Dream, Calibrating, and MTCP. MRC achieved an MAE of 11.300 and an RMSE of 14.237, which were numerically lower than those of M3DANet.
Under the fully supervised setting, the MAE and RMSE of M3DANet were 11.317 and 14.390, respectively, outperforming MCNN and TasselNetV2+ and approaching CSRNet. The Bias values show that Dream, Calibrating, and MTCP had clear negative bias on the Chinese honeybee dataset, especially under the 10% labeled setting, where their Bias values were −22.756, −24.231, and −24.998, respectively. In contrast, the Bias values of M3DANet under the 10% and 50% labeled settings were −5.086 and −0.859, respectively, indicating that M3DANet alleviated systematic underestimation in the cross-species generalization scenario.
Figure 12 further shows that M3DANet has a lower and more compact per-image error distribution than Dream, Calibrating, and MTCP, especially under the 10% and 30% labeled settings.
To further determine whether the observed differences were statistically reliable, paired significance analysis was performed based on image-level predictions. For each test image, the absolute error of M3DANet was compared with that of each baseline model, and the paired improvement was calculated as the baseline absolute error minus the M3DANet absolute error. Therefore, a positive paired improvement indicates that M3DANet produced a lower absolute error than the corresponding baseline on the same image. Because image-level errors may not follow a normal distribution, the Wilcoxon signed-rank test was used for paired testing, and the Holm–Bonferroni method was applied for multiple-comparison correction.
Figure 13 indicates that M3DANet significantly outperforms Dream, Calibrating, and MTCP under the 10% and 30% labeled settings after Holm correction. The paired difference between M3DANet and MRC is not significant at the 30% and 50% labeled settings, although M3DANet has a lower mean error at 30% and MRC is slightly lower at 50%.
Overall, the Chinese honeybee generalization experiment demonstrates that M3DANet has strong low-label cross-species generalization ability. In particular, M3DANet achieved the lowest MAE and RMSE among all semi-supervised methods under the 10% and 30% labeled settings. The paired tests further show significant improvements over Dream, Calibrating, and MTCP at these two labeled ratios. After including MRC, the differences between M3DANet and MRC under the 30% and 50% labeled settings were not statistically significant, indicating comparable performance when more labeled data are available. These results suggest that M3DANet is not only applicable to Italian honeybee images in the main dataset but can also transfer to Chinese honeybee images while maintaining stable counting performance under low-label cross-species conditions.
3.5. Hyperparameter Sensitivity and Stability Analysis
To further address the concern regarding the stability and justification of key hyperparameter settings, a sensitivity analysis was conducted for three parameters that directly affect the semi-supervised learning process: the semi-supervised consistency-loss weight
, the exponential moving average (EMA) decay coefficient β, and the confidence-mask quantile q. All the experiments were performed under the 50% labeled data setting. A one-factor-at-a-time strategy was adopted: unless otherwise specified, the reference configuration was fixed as
= 0.01, β = 0.999, and q = 0.90, and only the target hyperparameter was varied in each group. The remaining network architecture, data split, preprocessing strategy, and training protocol were kept unchanged. The results are summarized in
Table 5.
As shown in
Table 5, the repeated reference configuration obtained identical results in the three hyperparameter groups, with an MAE of 5.570 and an RMSE of 7.620, confirming that the experiments followed a controlled single-variable design. For the semi-supervised consistency-loss weight
, the best performance was achieved at
= 0.01. A smaller value weakened the use of unlabeled samples, whereas a larger value may introduce pseudo-label noise or over-constrain the student network.
For the EMA decay coefficient , the lowest MAE and RMSE were obtained when = 0.999. A smaller value caused the teacher model to update too rapidly and reduced pseudo-supervision stability, while a larger value made the teacher model adapt too slowly to the student network. Therefore, = 0.999 provided a suitable balance between temporal smoothing and update responsiveness.
For the confidence-mask quantile q, the best result was obtained at q = 0.90. A lower threshold may include low-confidence background responses, whereas a higher threshold may discard useful high-response regions. Overall, the sensitivity analysis supports the final settings of = 0.01, = 0.999, and q = 0.90, demonstrating the empirical rationality and robustness of the proposed semi-supervised training strategy.
3.6. Deployment Verification
M3DANet effectively reduced model complexity while maintaining high counting accuracy, making it suitable for deployment in resource-constrained real-world bee monitoring scenarios. To further evaluate its deployment feasibility, the trained model was deployed on a self-developed handheld bee-counting device based on the NVIDIA Jetson Orin NX platform (NVIDIA Corporation, Santa Clara, CA, USA). As shown in
Figure 14, the device was tested in a real beekeeping environment under natural lighting conditions.
Figure 14A and 14B show the collection process for the edge-area samples and the whole-frame samples, respectively;
Figure 14C shows close-range acquisition of a bee cluster region, and
Figure 14D shows the on-site collection process under handheld operation. The main deployment configuration and edge device benchmark results are shown in
Table 6. The deployed M3DANet model contains only 2.095 million parameters. On the Jetson Orin NX platform, the average inference latency was 65.75 ms/image, and the throughput of the complete processing flow was 10.44 FPS. The average process memory usage was 943.51 MB, further demonstrating the lightweight nature of M3DANet and its ability to run on portable edge hardware with moderate computational and memory overhead.
Figure 15 presents two representative visualization examples of field counting results under real-world variations. The first row shows a close-range bee cluster sample, and the second row shows a whole-frame bee colony sample with a more complex hive background. In both cases, the high-response regions of the predicted density maps are generally consistent with the actual bee cluster distributions, indicating that the model can effectively focus on bee-populated areas while suppressing background interference. The predicted counts for the two cases are 262 and 211, respectively, which further supports the real-time counting capability and deployment feasibility of M3DANet in practical beekeeping scenarios.
3.7. Failure Case Analysis
To further clarify the robustness boundary of the proposed method, a failure case analysis was added on the fixed test, where the SSL loss weight was 0.01 and 50% of the training images were labeled. Since M3DANet is a density regression model rather than an instance detector, image-level under-counting and over-counting were used to approximate missed-detection and false-positive-like tendencies, respectively. On the 58 test images, the model achieved an MAE of 5.570 and an RMSE of 7.620, with under-counting cases being more frequent than over-counting cases, indicating that missed detections were the dominant error pattern. As shown in
Figure 16, the under-counting case mainly occurs in a high-density bee cluster, where adjacent and partially occluded bees generate merged or weakened density responses, resulting in a lower predicted count than the ground truth. In contrast, over-counting tends to appear in sparse scenes, where honeycomb texture and visually similar local background patterns may introduce weak false-positive-like density responses. These observations indicate that M3DANet performs reliably on the fixed test set of the main dataset, but its remaining errors are mainly associated with dense occlusion and appearance ambiguity. Future work will therefore focus on robustness-oriented augmentation and uncertainty-aware pseudo-label filtering to further reduce missed counts in highly crowded field scenes.
4. Discussion
4.1. Overall Counting Performance: Synergy of Architecture and Semi-Supervised Strategy
Under the fully supervised setting (
Table 1), M3DANet achieved an MAE of 5.201 and an RMSE of 6.989. Compared with the lightweight models MCNN (24.899 and 35.771) and TasselNetV2+ (9.593 and 12.787), M3DANet substantially reduced the counting errors, indicating that the multi-scale context encoding (MSCE) and attention-guided low-level fusion (AGLF) modules can effectively handle scale variation and background clutter in bee images. Compared with the strong baseline CSRNet (5.298 and 6.856), M3DANet achieved a slightly lower MAE (−1.8%) and a slightly higher RMSE (+1.9%), while using only 12.9% of the parameters (2.095 M vs. 16.26 M) and achieving a 17.7 times higher inference speed (416.6 FPS vs. 23.6 FPS). Therefore, the main advantage of M3DANet is not uniform superiority over heavier models, but a stronger accuracy–efficiency trade-off.
Under the semi-supervised settings with 10%, 30%, and 50% labeled data ratios, M3DANet consistently achieved the lowest MAE and RMSE at all ratios (
Table 1). For example, at the 10% labeled data ratio, the MAE of M3DANet was 9.937, which was 14.3% lower than DREAM. At the 50% labeled data ratio, the MAE was 5.570, which was 20.3% lower than MRC. This consistent advantage comes from the joint effect of the network architecture and the semi-supervised learning strategy. On the one hand, MSCE and AGLF provide feature representations suitable for dense bee counting scenes. On the other hand, the complementary density and count consistency losses, together with confidence masking and warm-up training, enable the model to effectively exploit unlabeled samples.
4.2. Analytical Interpretation of Module Contributions: Complementarity of MSCE and AGLF
The structural ablation experiment (
Table 2) shows that the backbone-only model achieved an MAE of 7.064. Adding MSCE reduced the MAE to 6.455, corresponding to a relative reduction of 8.6%. Adding AGLF further reduced the MAE to 5.922, corresponding to a relative reduction of 16.2%. When both modules were combined, the MAE was further reduced to 5.201, corresponding to a relative reduction of 26.4%. Paired Wilcoxon tests confirmed that the full model significantly outperformed all reduced variants (
p < 0.05).
The improvement from MSCE can be attributed to its ability to capture sparse individuals, local clusters, and highly crowded regions simultaneously through atrous spatial pyramid pooling and multi-scale dilated convolutions. This ability is important because bee images usually show highly non-uniform density distributions. The improvement from AGLF is more pronounced because high-level semantic features tend to lose boundary and texture details after downsampling, while AGLF recovers spatial location cues of adjacent or partially occluded bees through coordinate attention and low-level feature fusion. It also suppresses false-positive background responses caused by honeycomb texture, wooden frame edges, and shadows. Therefore, MSCE and AGLF are complementary, and they jointly achieve multi-scale context modeling and low-level spatial detail restoration.
4.3. Mechanism of the Semi-Supervised Strategy: Complementarity of Consistency Losses and Stabilization
The semi-supervised ablation experiment (
Table 3, 50% labeled data ratio) further reveals the role of each component. Removing the semi-supervised loss entirely increased the MAE to 6.603, while the full framework reduced the MAE to 5.570, corresponding to a relative reduction of 15.6%. This result demonstrates that unlabeled samples can provide effective supplementary supervision through teacher–student consistency learning.
Using only count consistency or only density consistency resulted in MAE values of 6.801 and 7.192, respectively, and both were significantly higher than the joint version (5.570). This directly verifies their complementarity. Density consistency constrains the spatial distribution of the predicted density map, while count consistency constrains the global count. Together, they prevent the model from producing density maps that are visually plausible but numerically biased. Removing the confidence mask increased the MAE to 7.004, indicating that low-confidence pseudo-responses can introduce noise. Removing the warm-up strategy increased the MAE to 6.994, suggesting that imposing consistency loss too early may amplify errors when the teacher model is still unreliable. All the variants differed significantly from the full framework (p < 0.05), confirming the necessity of these designs for stabilizing teacher–student learning.
4.4. Generalization Ability and Labeling Efficiency
The cross-species generalization experiment on the Chinese honeybee dataset further evaluates the transferability of M3DANet under species and background shifts. Under the fully supervised setting, M3DANet achieved an MAE of 11.317 and an RMSE of 14.390, outperforming MCNN and TasselNetV2+ and approaching CSRNet, while maintaining a much smaller model size and higher inference efficiency. Under the semi-supervised settings, the advantage of M3DANet was more evident when labeled data were limited. With only 10% labeled training images, M3DANet achieved an MAE of 13.947 and an RMSE of 17.649, outperforming Dream, Calibrating, MTCP, and MRC. At the 30% labeled data ratio, it achieved the best semi-supervised performance, with an MAE of 11.945 and an RMSE of 14.078. These results indicate that the combination of multi-scale feature representation, attention-guided low-level fusion, and density and count consistency learning helps M3DANet exploit unlabeled images more effectively under cross-species domain shifts and reduces systematic underestimation.
From the labeling-efficiency perspective, the main-dataset results show that the MAE of M3DANet decreased from 9.937 to 7.003 and 5.570 as the labeled data ratio increased from 10% to 30% and 50%, while the RMSE decreased from 13.093 to 9.387 and 7.620. The improvement from 10% to 30% was larger than that from 30% to 50%, suggesting a diminishing-return trend as more labeled images were added. Notably, the 50% semi-supervised setting approached the fully supervised result, with only 0.369 higher MAE and 0.631 higher RMSE. Moreover, using only 50% labeled training images, M3DANet already achieved substantially lower counting errors than the fully supervised MCNN and TasselNetV2+ and approached the performance of CSRNet. These findings demonstrate that M3DANet can recover much of the performance that would otherwise require additional manual point annotations, providing a practical trade-off among annotation cost, counting accuracy, generalization ability, and deployment feasibility.
4.5. Limitations
Despite the encouraging results, several limitations remain.
- (1)
Sensitivity to extreme density conditions: As shown in the failure case analysis (
Section 3.7), adjacent bees in highly crowded and severely occluded regions may produce merged or weakened density responses, which can lead to under-counting. When the local density exceeds the maximum density range observed in the training set, the prediction bias may increase systematically.
- (2)
Dataset constraints: Although the main dataset covers representative variations in bee density, shooting distance, illumination, viewing angle, and hive background conditions, its scale and environmental diversity remain limited compared with long-term real-world beekeeping conditions. In the Chinese honeybee generalization experiment, MRC performed similarly to M3DANet at the 50% labeled data ratio, indicating that cross-species generalization is still constrained by the coverage of training data.
- (3)
Dependence on pseudo-label quality: The ablation experiments show that removing the confidence mask or the warm-up strategy significantly increased the MAE to approximately 7.0, indicating that pseudo-label noise can still affect the student model. When the labeled data ratio is extremely low or when the unlabeled data distribution differs greatly from the labeled data distribution, such noise becomes harder to suppress.
Future work will focus on constructing larger and more diverse datasets, incorporating uncertainty-aware pseudo-label filtering, and designing adaptive inference strategies for extreme-density scenes.
5. Conclusions
In dense bee counting scenarios, manual point-level annotation is labor-intensive and deployment resources are often limited, making it challenging to balance counting accuracy, annotation efficiency, and practical deployability. To address this problem, this study proposes M3DANet, a lightweight semi-supervised bee counting method. The model is built on a MobileNetV3-Large backbone and incorporates a multi-scale context encoding (MSCE) module to capture density variations from sparse individuals to highly crowded clusters, as well as an attention-guided low-level fusion (AGLF) module to recover fine spatial details and suppress background interference. For semi-supervised learning, a teacher–student framework with density consistency, count consistency, confidence masking, and a warm-up strategy is employed to effectively exploit unlabeled images under limited annotation budgets.
Extensive experiments demonstrate that M3DANet consistently outperforms representative semi-supervised methods, including DREAM, Calibrating, MTCP, and MRC, at labeled ratios of 10%, 30%, and 50%. For example, at 10% labeling, M3DANet achieves an MAE of 9.937, which is 14.3% lower than DREAM, and at 50% labeling, the MAE is 5.570, which is 20.3% lower than MRC. Under the fully supervised setting, M3DANet attains an MAE of 5.201 and an RMSE of 6.989 with only 2.095 M parameters and an inference speed of 416.64 FPS, achieving a strong accuracy–efficiency trade-off compared with heavier models such as CSRNet and more compact ones such as MCNN and TasselNetV2+. Edge deployment on a Jetson Orin NX device further confirms its real-time counting capability and engineering feasibility in real beekeeping environments.
Overall, M3DANet provides an effective solution for automated bee counting that jointly considers counting accuracy, annotation efficiency, and deployability. Future work will focus on expanding data diversity across seasons, hive types, and bee species, incorporating temporal modeling from video sequences, improving semi-supervised pseudo-label filtering with uncertainty estimation, and conducting long-term stability evaluation under continuous outdoor operation.